We gratefully acknowledge support from
the Simons Foundation and member institutions.

Computer Vision and Pattern Recognition

New submissions

[ total of 297 entries: 1-297 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Tue, 23 Apr 24

[1]  arXiv:2404.13130 [pdf, other]
Title: On-board classification of underwater images using hybrid classical-quantum CNN based method
Subjects: Computer Vision and Pattern Recognition (cs.CV); Quantum Physics (quant-ph)

Underwater images taken from autonomous underwater vehicles (AUV's) often suffer from low light, high turbidity, poor contrast, motion-blur and excessive light scattering and hence require image enhancement techniques for object recognition. Machine learning methods are being increasingly used for object recognition under such adverse conditions. These enhanced object recognition methods of images taken from AUV's has potential applications in underwater pipeline and optical fibre surveillance, ocean bed resource extraction, ocean floor mapping, underwater species exploration, etc. While the classical machine learning methods are very efficient in terms of accuracy, they require large datasets and high computational time for image classification. In the current work, we use quantum-classical hybrid machine learning methods for real-time under-water object recognition on-board an AUV for the first time. We use real-time motion-blurred and low-light images taken from an on-board camera of AUV built in-house and apply existing hybrid machine learning methods for object recognition. Our hybrid methods consist of quantum encoding and flattening of classical images using quantum circuits and sending them to classical neural networks for image classification. The results of hybrid methods carried out using Pennylane based quantum simulators both on GPU and using pre-trained models on an on-board NVIDIA GPU chipset are compared with results from corresponding classical machine learning methods. We observe that the hybrid quantum machine learning methods show an efficiency greater than 65\% and reduction in run-time by one-thirds and require 50\% smaller dataset sizes for training the models compared to classical machine learning methods. We hope that our work opens up further possibilities in quantum enhanced real-time computer vision in autonomous vehicles.

[2]  arXiv:2404.13148 [pdf, other]
Title: BACS: Background Aware Continual Semantic Segmentation
Comments: 8 pages, 4 figures, CRV 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Semantic segmentation plays a crucial role in enabling comprehensive scene understanding for robotic systems. However, generating annotations is challenging, requiring labels for every pixel in an image. In scenarios like autonomous driving, there's a need to progressively incorporate new classes as the operating environment of the deployed agent becomes more complex. For enhanced annotation efficiency, ideally, only pixels belonging to new classes would be annotated. This approach is known as Continual Semantic Segmentation (CSS). Besides the common problem of classical catastrophic forgetting in the continual learning setting, CSS suffers from the inherent ambiguity of the background, a phenomenon we refer to as the "background shift'', since pixels labeled as background could correspond to future classes (forward background shift) or previous classes (backward background shift). As a result, continual learning approaches tend to fail. This paper proposes a Backward Background Shift Detector (BACS) to detect previously observed classes based on their distance in the latent space from the foreground centroids of previous steps. Moreover, we propose a modified version of the cross-entropy loss function, incorporating the BACS detector to down-weight background pixels associated with formerly observed classes. To combat catastrophic forgetting, we employ masked feature distillation alongside dark experience replay. Additionally, our approach includes a transformer decoder capable of adjusting to new classes without necessitating an additional classification head. We validate BACS's superior performance over existing state-of-the-art methods on standard CSS benchmarks.

[3]  arXiv:2404.13159 [pdf, other]
Title: Equivariant Imaging for Self-supervised Hyperspectral Image Inpainting
Comments: 5 Pages, 4 Figures, 2 Tables
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

Hyperspectral imaging (HSI) is a key technology for earth observation, surveillance, medical imaging and diagnostics, astronomy and space exploration. The conventional technology for HSI in remote sensing applications is based on the push-broom scanning approach in which the camera records the spectral image of a stripe of the scene at a time, while the image is generated by the aggregation of measurements through time. In real-world airborne and spaceborne HSI instruments, some empty stripes would appear at certain locations, because platforms do not always maintain a constant programmed attitude, or have access to accurate digital elevation maps (DEM), and the travelling track is not necessarily aligned with the hyperspectral cameras at all times. This makes the enhancement of the acquired HS images from incomplete or corrupted observations an essential task. We introduce a novel HSI inpainting algorithm here, called Hyperspectral Equivariant Imaging (Hyper-EI). Hyper-EI is a self-supervised learning-based method which does not require training on extensive datasets or access to a pre-trained model. Experimental results show that the proposed method achieves state-of-the-art inpainting performance compared to the existing methods.

[4]  arXiv:2404.13237 [pdf, other]
Title: PAFedFV: Personalized and Asynchronous Federated Learning for Finger Vein Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV)

With the increasing emphasis on user privacy protection, biometric recognition based on federated learning have become the latest research hotspot. However, traditional federated learning methods cannot be directly applied to finger vein recognition, due to heterogeneity of data and open-set verification. Therefore, only a few application cases have been proposed. And these methods still have two drawbacks. (1) Uniform model results in poor performance in some clients, as the finger vein data is highly heterogeneous and non-Independently Identically Distributed (non-IID). (2) On individual client, a large amount of time is underutilized, such as the time to wait for returning model from server. To address those problems, this paper proposes a Personalized and Asynchronous Federated Learning for Finger Vein Recognition (PAFedFV) framework. PAFedFV designs personalized model aggregation method to solve the heterogeneity among non-IID data. Meanwhile, it employs an asynchronized training module for clients to utilize their waiting time. Finally, extensive experiments on six finger vein datasets are conducted. Base on these experiment results, the impact of non-IID finger vein data on performance of federated learning are analyzed, and the superiority of PAFedFV in accuracy and robustness are demonstrated.

[5]  arXiv:2404.13239 [pdf, other]
Title: Beyond Pixel-Wise Supervision for Medical Image Segmentation: From Traditional Models to Foundation Models
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Medical image segmentation plays an important role in many image-guided clinical approaches. However, existing segmentation algorithms mostly rely on the availability of fully annotated images with pixel-wise annotations for training, which can be both labor-intensive and expertise-demanding, especially in the medical imaging domain where only experts can provide reliable and accurate annotations. To alleviate this challenge, there has been a growing focus on developing segmentation methods that can train deep models with weak annotations, such as image-level, bounding boxes, scribbles, and points. The emergence of vision foundation models, notably the Segment Anything Model (SAM), has introduced innovative capabilities for segmentation tasks using weak annotations for promptable segmentation enabled by large-scale pre-training. Adopting foundation models together with traditional learning methods has increasingly gained recent interest research community and shown potential for real-world applications. In this paper, we present a comprehensive survey of recent progress on annotation-efficient learning for medical image segmentation utilizing weak annotations before and in the era of foundation models. Furthermore, we analyze and discuss several challenges of existing approaches, which we believe will provide valuable guidance for shaping the trajectory of foundational models to further advance the field of medical image segmentation.

[6]  arXiv:2404.13252 [pdf, other]
Title: 3D-Convolution Guided Spectral-Spatial Transformer for Hyperspectral Image Classification
Comments: Accepted in IEEE Conference on Artificial Intelligence, 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

In recent years, Vision Transformers (ViTs) have shown promising classification performance over Convolutional Neural Networks (CNNs) due to their self-attention mechanism. Many researchers have incorporated ViTs for Hyperspectral Image (HSI) classification. HSIs are characterised by narrow contiguous spectral bands, providing rich spectral data. Although ViTs excel with sequential data, they cannot extract spectral-spatial information like CNNs. Furthermore, to have high classification performance, there should be a strong interaction between the HSI token and the class (CLS) token. To solve these issues, we propose a 3D-Convolution guided Spectral-Spatial Transformer (3D-ConvSST) for HSI classification that utilizes a 3D-Convolution Guided Residual Module (CGRM) in-between encoders to "fuse" the local spatial and spectral information and to enhance the feature propagation. Furthermore, we forego the class token and instead apply Global Average Pooling, which effectively encodes more discriminative and pertinent high-level features for classification. Extensive experiments have been conducted on three public HSI datasets to show the superiority of the proposed model over state-of-the-art traditional, convolutional, and Transformer models. The code is available at https://github.com/ShyamVarahagiri/3D-ConvSST.

[7]  arXiv:2404.13263 [pdf, other]
Title: FilterPrompt: Guiding Image Transfer in Diffusion Models
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In controllable generation tasks, flexibly manipulating the generated images to attain a desired appearance or structure based on a single input image cue remains a critical and longstanding challenge. Achieving this requires the effective decoupling of key attributes within the input image data, aiming to get representations accurately. Previous research has predominantly concentrated on disentangling image attributes within feature space. However, the complex distribution present in real-world data often makes the application of such decoupling algorithms to other datasets challenging. Moreover, the granularity of control over feature encoding frequently fails to meet specific task requirements. Upon scrutinizing the characteristics of various generative models, we have observed that the input sensitivity and dynamic evolution properties of the diffusion model can be effectively fused with the explicit decomposition operation in pixel space. This integration enables the image processing operations performed in pixel space for a specific feature distribution of the input image, and can achieve the desired control effect in the generated results. Therefore, we propose FilterPrompt, an approach to enhance the model control effect. It can be universally applied to any diffusion model, allowing users to adjust the representation of specific image features in accordance with task requirements, thereby facilitating more precise and controllable generation outcomes. In particular, our designed experiments demonstrate that the FilterPrompt optimizes feature correlation, mitigates content conflicts during the generation process, and enhances the model's control capability.

[8]  arXiv:2404.13268 [pdf, other]
Title: Multi-Cell Decoder and Mutual Learning for Table Structure and Character Recognition
Authors: Takaya Kawakatsu
Comments: ICDAR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Extracting table contents from documents such as scientific papers and financial reports and converting them into a format that can be processed by large language models is an important task in knowledge information processing. End-to-end approaches, which recognize not only table structure but also cell contents, achieved performance comparable to state-of-the-art models using external character recognition systems, and have potential for further improvements. In addition, these models can now recognize long tables with hundreds of cells by introducing local attention. However, the models recognize table structure in one direction from the header to the footer, and cell content recognition is performed independently for each cell, so there is no opportunity to retrieve useful information from the neighbor cells. In this paper, we propose a multi-cell content decoder and bidirectional mutual learning mechanism to improve the end-to-end approach. The effectiveness is demonstrated on two large datasets, and the experimental results show comparable performance to state-of-the-art models, even for long tables with large numbers of cells.

[9]  arXiv:2404.13270 [pdf, other]
Title: StrideNET: Swin Transformer for Terrain Recognition with Dynamic Roughness Extraction
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Advancements in deep learning are revolutionizing the classification of remote-sensing images. Transformer-based architectures, utilizing self-attention mechanisms, have emerged as alternatives to conventional convolution methods, enabling the capture of long-range dependencies along with global relationships in the image. Motivated by these advancements, this paper presents StrideNET, a novel dual-branch architecture designed for terrain recognition and implicit properties estimation. The terrain recognition branch utilizes the Swin Transformer, leveraging its hierarchical representation and low computational cost to efficiently capture both local and global features. The terrain properties branch focuses on the extraction of surface properties such as roughness and slipperiness using a statistical texture analysis method. By computing surface terrain properties, an enhanced environmental perception can be obtained. The StrideNET model is trained on a dataset comprising four target terrain classes: Grassy, Marshy, Sandy, and Rocky. StrideNET attains competitive performance compared to contemporary methods. The implications of this work extend to various applications, including environmental monitoring, land use and land cover (LULC) classification, disaster response, precision agriculture, and much more.

[10]  arXiv:2404.13273 [pdf, other]
Title: Multi-feature Reconstruction Network using Crossed-mask Restoration for Unsupervised Anomaly Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Unsupervised anomaly detection using only normal samples is of great significance for quality inspection in industrial manufacturing. Although existing reconstruction-based methods have achieved promising results, they still face two problems: poor distinguishable information in image reconstruction and well abnormal regeneration caused by model over-generalization ability. To overcome the above issues, we convert the image reconstruction into a combination of parallel feature restorations and propose a multi-feature reconstruction network, MFRNet, using crossed-mask restoration in this paper. Specifically, a multi-scale feature aggregator is first developed to generate more discriminative hierarchical representations of the input images from a pre-trained model. Subsequently, a crossed-mask generator is adopted to randomly cover the extracted feature map, followed by a restoration network based on the transformer structure for high-quality repair of the missing regions. Finally, a hybrid loss is equipped to guide model training and anomaly estimation, which gives consideration to both the pixel and structural similarity. Extensive experiments show that our method is highly competitive with or significantly outperforms other state-of-the-arts on four public available datasets and one self-made dataset.

[11]  arXiv:2404.13282 [pdf, other]
Title: Wills Aligner: A Robust Multi-Subject Brain Representation Learner
Comments: 15 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

Decoding visual information from human brain activity has seen remarkable advancements in recent research. However, due to the significant variability in cortical parcellation and cognition patterns across subjects, current approaches personalized deep models for each subject, constraining the practicality of this technology in real-world contexts. To tackle the challenges, we introduce Wills Aligner, a robust multi-subject brain representation learner. Our Wills Aligner initially aligns different subjects' brains at the anatomical level. Subsequently, it incorporates a mixture of brain experts to learn individual cognition patterns. Additionally, it decouples the multi-subject learning task into a two-stage training, propelling the deep model and its plugin network to learn inter-subject commonality knowledge and various cognition patterns, respectively. Wills Aligner enables us to overcome anatomical differences and to efficiently leverage a single model for multi-subject brain representation learning. We meticulously evaluate the performance of our approach across coarse-grained and fine-grained visual decoding tasks. The experimental results demonstrate that our Wills Aligner achieves state-of-the-art performance.

[12]  arXiv:2404.13299 [pdf, other]
Title: PCQA: A Strong Baseline for AIGC Quality Assessment Based on Prompt Condition
Comments: Published in CVPR-2024's NTIRE: New Trends in Image Restoration and Enhancement workshop and challenges
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The development of Large Language Models (LLM) and Diffusion Models brings the boom of Artificial Intelligence Generated Content (AIGC). It is essential to build an effective quality assessment framework to provide a quantifiable evaluation of different images or videos based on the AIGC technologies. The content generated by AIGC methods is driven by the crafted prompts. Therefore, it is intuitive that the prompts can also serve as the foundation of the AIGC quality assessment. This study proposes an effective AIGC quality assessment (QA) framework. First, we propose a hybrid prompt encoding method based on a dual-source CLIP (Contrastive Language-Image Pre-Training) text encoder to understand and respond to the prompt conditions. Second, we propose an ensemble-based feature mixer module to effectively blend the adapted prompt and vision features. The empirical study practices in two datasets: AIGIQA-20K (AI-Generated Image Quality Assessment database) and T2VQA-DB (Text-to-Video Quality Assessment DataBase), which validates the effectiveness of our proposed method: Prompt Condition Quality Assessment (PCQA). Our proposed simple and feasible framework may promote research development in the multimodal generation field.

[13]  arXiv:2404.13306 [pdf, other]
Title: FakeBench: Uncover the Achilles' Heels of Fake Images with Large Multimodal Models
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

Recently, fake images generated by artificial intelligence (AI) models have become indistinguishable from the real, exerting new challenges for fake image detection models. To this extent, simple binary judgments of real or fake seem less convincing and credible due to the absence of human-understandable explanations. Fortunately, Large Multimodal Models (LMMs) bring possibilities to materialize the judgment process while their performance remains undetermined. Therefore, we propose FakeBench, the first-of-a-kind benchmark towards transparent defake, consisting of fake images with human language descriptions on forgery signs. FakeBench gropes for two open questions of LMMs: (1) can LMMs distinguish fake images generated by AI, and (2) how do LMMs distinguish fake images? In specific, we construct the FakeClass dataset with 6k diverse-sourced fake and real images, each equipped with a Question&Answer pair concerning the authenticity of images, which are utilized to benchmark the detection ability. To examine the reasoning and interpretation abilities of LMMs, we present the FakeClue dataset, consisting of 15k pieces of descriptions on the telltale clues revealing the falsification of fake images. Besides, we construct the FakeQA to measure the LMMs' open-question answering ability on fine-grained authenticity-relevant aspects. Our experimental results discover that current LMMs possess moderate identification ability, preliminary interpretation and reasoning ability, and passable open-question answering ability for image defake. The FakeBench will be made publicly available soon.

[14]  arXiv:2404.13311 [pdf, other]
Title: STAT: Towards Generalizable Temporal Action Localization
Comments: 14 pages, LaTeX;
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Weakly-supervised temporal action localization (WTAL) aims to recognize and localize action instances with only video-level labels. Despite the significant progress, existing methods suffer from severe performance degradation when transferring to different distributions and thus may hardly adapt to real-world scenarios . To address this problem, we propose the Generalizable Temporal Action Localization task (GTAL), which focuses on improving the generalizability of action localization methods. We observed that the performance decline can be primarily attributed to the lack of generalizability to different action scales. To address this problem, we propose STAT (Self-supervised Temporal Adaptive Teacher), which leverages a teacher-student structure for iterative refinement. Our STAT features a refinement module and an alignment module. The former iteratively refines the model's output by leveraging contextual information and helps adapt to the target scale. The latter improves the refinement process by promoting a consensus between student and teacher models. We conduct extensive experiments on three datasets, THUMOS14, ActivityNet1.2, and HACS, and the results show that our method significantly improves the Baseline methods under the cross-distribution evaluation setting, even approaching the same-distribution evaluation performance.

[15]  arXiv:2404.13320 [pdf, other]
Title: Pixel is a Barrier: Diffusion Models Are More Adversarially Robust Than We Think
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Adversarial examples for diffusion models are widely used as solutions for safety concerns. By adding adversarial perturbations to personal images, attackers can not edit or imitate them easily. However, it is essential to note that all these protections target the latent diffusion model (LDMs), the adversarial examples for diffusion models in the pixel space (PDMs) are largely overlooked. This may mislead us to think that the diffusion models are vulnerable to adversarial attacks like most deep models. In this paper, we show novel findings that: even though gradient-based white-box attacks can be used to attack the LDMs, they fail to attack PDMs. This finding is supported by extensive experiments of almost a wide range of attacking methods on various PDMs and LDMs with different model structures, which means diffusion models are indeed much more robust against adversarial attacks. We also find that PDMs can be used as an off-the-shelf purifier to effectively remove the adversarial patterns that were generated on LDMs to protect the images, which means that most protection methods nowadays, to some extent, cannot protect our images from malicious attacks. We hope that our insights will inspire the community to rethink the adversarial samples for diffusion models as protection methods and move forward to more effective protection. Codes are available in https://github.com/xavihart/PDM-Pure.

[16]  arXiv:2404.13324 [pdf, other]
Title: Collaborative Visual Place Recognition through Federated Learning
Comments: 13 pages, 7 figures, CVPR - The 3rd International Workshop on Federated Learning for Computer Vision (FedVision-2024)
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Visual Place Recognition (VPR) aims to estimate the location of an image by treating it as a retrieval problem. VPR uses a database of geo-tagged images and leverages deep neural networks to extract a global representation, called descriptor, from each image. While the training data for VPR models often originates from diverse, geographically scattered sources (geo-tagged images), the training process itself is typically assumed to be centralized. This research revisits the task of VPR through the lens of Federated Learning (FL), addressing several key challenges associated with this adaptation. VPR data inherently lacks well-defined classes, and models are typically trained using contrastive learning, which necessitates a data mining step on a centralized database. Additionally, client devices in federated systems can be highly heterogeneous in terms of their processing capabilities. The proposed FedVPR framework not only presents a novel approach for VPR but also introduces a new, challenging, and realistic task for FL research, paving the way to other image retrieval tasks in FL.

[17]  arXiv:2404.13342 [pdf, other]
Title: Hyperspectral Anomaly Detection with Self-Supervised Anomaly Prior
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

The majority of existing hyperspectral anomaly detection (HAD) methods use the low-rank representation (LRR) model to separate the background and anomaly components, where the anomaly component is optimized by handcrafted sparse priors (e.g., $\ell_{2,1}$-norm). However, this may not be ideal since they overlook the spatial structure present in anomalies and make the detection result largely dependent on manually set sparsity. To tackle these problems, we redefine the optimization criterion for the anomaly component in the LRR model with a self-supervised network called self-supervised anomaly prior (SAP). This prior is obtained by the pretext task of self-supervised learning, which is customized to learn the characteristics of hyperspectral anomalies. Specifically, this pretext task is a classification task to distinguish the original hyperspectral image (HSI) and the pseudo-anomaly HSI, where the pseudo-anomaly is generated from the original HSI and designed as a prism with arbitrary polygon bases and arbitrary spectral bands. In addition, a dual-purified strategy is proposed to provide a more refined background representation with an enriched background dictionary, facilitating the separation of anomalies from complex backgrounds. Extensive experiments on various hyperspectral datasets demonstrate that the proposed SAP offers a more accurate and interpretable solution than other advanced HAD methods.

[18]  arXiv:2404.13353 [pdf, other]
Title: Generating Daylight-driven Architectural Design via Diffusion Models
Comments: Project page: this https URL
Journal-ref: CVPR 2024 Workshop
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In recent years, the rapid development of large-scale models has made new possibilities for interdisciplinary fields such as architecture. In this paper, we present a novel daylight-driven AI-aided architectural design method. Firstly, we formulate a method for generating massing models, producing architectural massing models using random parameters quickly. Subsequently, we integrate a daylight-driven facade design strategy, accurately determining window layouts and applying them to the massing models. Finally, we seamlessly combine a large-scale language model with a text-to-image model, enhancing the efficiency of generating visual architectural design renderings. Experimental results demonstrate that our approach supports architects' creative inspirations and pioneers novel avenues for architectural design development. Project page: https://zrealli.github.io/DDADesign/.

[19]  arXiv:2404.13370 [pdf, other]
Title: Movie101v2: Improved Movie Narration Benchmark
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)

Automatic movie narration targets at creating video-aligned plot descriptions to assist visually impaired audiences. It differs from standard video captioning in that it requires not only describing key visual details but also inferring the plots developed across multiple movie shots, thus posing unique and ongoing challenges. To advance the development of automatic movie narrating systems, we first revisit the limitations of existing datasets and develop a large-scale, bilingual movie narration dataset, Movie101v2. Second, taking into account the essential difficulties in achieving applicable movie narration, we break the long-term goal into three progressive stages and tentatively focus on the initial stages featuring understanding within individual clips. We also introduce a new narration assessment to align with our staged task goals. Third, using our new dataset, we baseline several leading large vision-language models, including GPT-4V, and conduct in-depth investigations into the challenges current models face for movie narration generation. Our findings reveal that achieving applicable movie narration generation is a fascinating goal that requires thorough research.

[20]  arXiv:2404.13400 [pdf, other]
Title: HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding
Comments: The project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Visual grounding, which aims to ground a visual region via natural language, is a task that heavily relies on cross-modal alignment. Existing works utilized uni-modal pre-trained models to transfer visual/linguistic knowledge separately while ignoring the multimodal corresponding information. Motivated by recent advancements in contrastive language-image pre-training and low-rank adaptation (LoRA) methods, we aim to solve the grounding task based on multimodal pre-training. However, there exists significant task gaps between pre-training and grounding. Therefore, to address these gaps, we propose a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG. Specifically, HiVG consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (Hi LoRA) paradigm. The cross-modal bridge can address the inconsistency between visual features and those required for grounding, and establish a connection between multi-level visual and text features. Hi LoRA prevents the accumulation of perceptual errors by adapting the cross-modal features from shallow to deep layers in a hierarchical manner. Experimental results on five datasets demonstrate the effectiveness of our approach and showcase the significant grounding capabilities as well as promising energy efficiency advantages. The project page: https://github.com/linhuixiao/HiVG.

[21]  arXiv:2404.13408 [pdf, other]
Title: AMMUNet: Multi-Scale Attention Map Merging for Remote Sensing Image Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The advancement of deep learning has driven notable progress in remote sensing semantic segmentation. Attention mechanisms, while enabling global modeling and utilizing contextual information, face challenges of high computational costs and require window-based operations that weaken capturing long-range dependencies, hindering their effectiveness for remote sensing image processing. In this letter, we propose AMMUNet, a UNet-based framework that employs multi-scale attention map merging, comprising two key innovations: the granular multi-head self-attention (GMSA) module and the attention map merging mechanism (AMMM). GMSA efficiently acquires global information while substantially mitigating computational costs in contrast to global multi-head self-attention mechanism. This is accomplished through the strategic utilization of dimension correspondence to align granularity and the reduction of relative position bias parameters, thereby optimizing computational efficiency. The proposed AMMM effectively combines multi-scale attention maps into a unified representation using a fixed mask template, enabling the modeling of global attention mechanism. Experimental evaluations highlight the superior performance of our approach, achieving remarkable mean intersection over union (mIoU) scores of 75.48\% on the challenging Vaihingen dataset and an exceptional 77.90\% on the Potsdam dataset, demonstrating the superiority of our method in precise remote sensing semantic segmentation. Codes are available at https://github.com/interpretty/AMMUNet.

[22]  arXiv:2404.13417 [pdf, other]
Title: Efficient and Concise Explanations for Object Detection with Gaussian-Class Activation Mapping Explainer
Comments: Canadian AI 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

To address the challenges of providing quick and plausible explanations in Explainable AI (XAI) for object detection models, we introduce the Gaussian Class Activation Mapping Explainer (G-CAME). Our method efficiently generates concise saliency maps by utilizing activation maps from selected layers and applying a Gaussian kernel to emphasize critical image regions for the predicted object. Compared with other Region-based approaches, G-CAME significantly reduces explanation time to 0.5 seconds without compromising the quality. Our evaluation of G-CAME, using Faster-RCNN and YOLOX on the MS-COCO 2017 dataset, demonstrates its ability to offer highly plausible and faithful explanations, especially in reducing the bias on tiny object detection.

[23]  arXiv:2404.13420 [pdf, other]
Title: NeurCADRecon: Neural Representation for Reconstructing CAD Surfaces by Enforcing Zero Gaussian Curvature
Comments: ACM Transactions on Graphics (SIGGRAPH 2024)
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Despite recent advances in reconstructing an organic model with the neural signed distance function (SDF), the high-fidelity reconstruction of a CAD model directly from low-quality unoriented point clouds remains a significant challenge. In this paper, we address this challenge based on the prior observation that the surface of a CAD model is generally composed of piecewise surface patches, each approximately developable even around the feature line. Our approach, named NeurCADRecon, is self-supervised, and its loss includes a developability term to encourage the Gaussian curvature toward 0 while ensuring fidelity to the input points. Noticing that the Gaussian curvature is non-zero at tip points, we introduce a double-trough curve to tolerate the existence of these tip points. Furthermore, we develop a dynamic sampling strategy to deal with situations where the given points are incomplete or too sparse. Since our resulting neural SDFs can clearly manifest sharp feature points/lines, one can easily extract the feature-aligned triangle mesh from the SDF and then decompose it into smooth surface patches, greatly reducing the difficulty of recovering the parametric CAD design. A comprehensive comparison with existing state-of-the-art methods shows the significant advantage of our approach in reconstructing faithful CAD shapes.

[24]  arXiv:2404.13425 [pdf, other]
Title: AdvLoRA: Adversarial Low-Rank Adaptation of Vision-Language Models
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Vision-Language Models (VLMs) are a significant technique for Artificial General Intelligence (AGI). With the fast growth of AGI, the security problem become one of the most important challenges for VLMs. In this paper, through extensive experiments, we demonstrate the vulnerability of the conventional adaptation methods for VLMs, which may bring significant security risks. In addition, as the size of the VLMs increases, performing conventional adversarial adaptation techniques on VLMs results in high computational costs. To solve these problems, we propose a parameter-efficient \underline{Adv}ersarial adaptation method named \underline{AdvLoRA} by \underline{Lo}w-\underline{R}ank \underline{A}daptation. At first, we investigate and reveal the intrinsic low-rank property during the adversarial adaptation for VLMs. Different from LoRA, we improve the efficiency and robustness of adversarial adaptation by designing a novel reparameterizing method based on parameter clustering and parameter alignment. In addition, an adaptive parameter update strategy is proposed to further improve the robustness. By these settings, our proposed AdvLoRA alleviates the model security and high resource waste problems. Extensive experiments demonstrate the effectiveness and efficiency of the AdvLoRA.

[25]  arXiv:2404.13434 [pdf, other]
Title: Nested-TNT: Hierarchical Vision Transformers with Multi-Scale Feature Processing
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Transformer has been applied in the field of computer vision due to its excellent performance in natural language processing, surpassing traditional convolutional neural networks and achieving new state-of-the-art. ViT divides an image into several local patches, known as "visual sentences". However, the information contained in the image is vast and complex, and focusing only on the features at the "visual sentence" level is not enough. The features between local patches should also be taken into consideration. In order to achieve further improvement, the TNT model is proposed, whose algorithm further divides the image into smaller patches, namely "visual words," achieving more accurate results. The core of Transformer is the Multi-Head Attention mechanism, and traditional attention mechanisms ignore interactions across different attention heads. In order to reduce redundancy and improve utilization, we introduce the nested algorithm and apply the Nested-TNT to image classification tasks. The experiment confirms that the proposed model has achieved better classification performance over ViT and TNT, exceeding 2.25%, 1.1% on dataset CIFAR10 and 2.78%, 0.25% on dataset FLOWERS102 respectively.

[26]  arXiv:2404.13437 [pdf, other]
Title: High-fidelity Endoscopic Image Synthesis by Utilizing Depth-guided Neural Surfaces
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In surgical oncology, screening colonoscopy plays a pivotal role in providing diagnostic assistance, such as biopsy, and facilitating surgical navigation, particularly in polyp detection. Computer-assisted endoscopic surgery has recently gained attention and amalgamated various 3D computer vision techniques, including camera localization, depth estimation, surface reconstruction, etc. Neural Radiance Fields (NeRFs) and Neural Implicit Surfaces (NeuS) have emerged as promising methodologies for deriving accurate 3D surface models from sets of registered images, addressing the limitations of existing colon reconstruction approaches stemming from constrained camera movement.
However, the inadequate tissue texture representation and confused scale problem in monocular colonoscopic image reconstruction still impede the progress of the final rendering results. In this paper, we introduce a novel method for colon section reconstruction by leveraging NeuS applied to endoscopic images, supplemented by a single frame of depth map. Notably, we pioneered the exploration of utilizing only one frame depth map in photorealistic reconstruction and neural rendering applications while this single depth map can be easily obtainable from other monocular depth estimation networks with an object scale. Through rigorous experimentation and validation on phantom imagery, our approach demonstrates exceptional accuracy in completely rendering colon sections, even capturing unseen portions of the surface. This breakthrough opens avenues for achieving stable and consistently scaled reconstructions, promising enhanced quality in cancer screening procedures and treatment interventions.

[27]  arXiv:2404.13443 [pdf, other]
Title: FisheyeDetNet: Object Detection on Fisheye Surround View Camera Systems for Automated Driving
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Object detection is a mature problem in autonomous driving with pedestrian detection being one of the first deployed algorithms. It has been comprehensively studied in the literature. However, object detection is relatively less explored for fisheye cameras used for surround-view near field sensing. The standard bounding box representation fails in fisheye cameras due to heavy radial distortion, particularly in the periphery. To mitigate this, we explore extending the standard object detection output representation of bounding box. We design rotated bounding boxes, ellipse, generic polygon as polar arc/angle representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model FisheyeDetNet with polygon outperforms others and achieves a mAP score of 49.5 % on Valeo fisheye surround-view dataset for automated driving applications. This dataset has 60K images captured from 4 surround-view cameras across Europe, North America and Asia. To the best of our knowledge, this is the first detailed study on object detection on fisheye cameras for autonomous driving scenarios.

[28]  arXiv:2404.13445 [pdf, other]
Title: DMesh: A Differentiable Representation for General Meshes
Comments: 17 pages, 9 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

We present a differentiable representation, DMesh, for general 3D triangular meshes. DMesh considers both the geometry and connectivity information of a mesh. In our design, we first get a set of convex tetrahedra that compactly tessellates the domain based on Weighted Delaunay Triangulation (WDT), and formulate probability of faces to exist on our desired mesh in a differentiable manner based on the WDT. This enables DMesh to represent meshes of various topology in a differentiable way, and allows us to reconstruct the mesh under various observations, such as point cloud and multi-view images using gradient-based optimization. The source code and full paper is available at: https://sonsang.github.io/dmesh-project.

[29]  arXiv:2404.13449 [pdf, other]
Title: SiNC+: Adaptive Camera-Based Vitals with Unsupervised Learning of Periodic Signals
Comments: Extension of CVPR2023 highlight paper. arXiv admin note: substantial text overlap with arXiv:2303.07944
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Subtle periodic signals, such as blood volume pulse and respiration, can be extracted from RGB video, enabling noncontact health monitoring at low cost. Advancements in remote pulse estimation -- or remote photoplethysmography (rPPG) -- are currently driven by deep learning solutions. However, modern approaches are trained and evaluated on benchmark datasets with ground truth from contact-PPG sensors. We present the first non-contrastive unsupervised learning framework for signal regression to mitigate the need for labelled video data. With minimal assumptions of periodicity and finite bandwidth, our approach discovers the blood volume pulse directly from unlabelled videos. We find that encouraging sparse power spectra within normal physiological bandlimits and variance over batches of power spectra is sufficient for learning visual features of periodic signals. We perform the first experiments utilizing unlabelled video data not specifically created for rPPG to train robust pulse rate estimators. Given the limited inductive biases, we successfully applied the same approach to camera-based respiration by changing the bandlimits of the target signal. This shows that the approach is general enough for unsupervised learning of bandlimited quasi-periodic signals from different domains. Furthermore, we show that the framework is effective for finetuning models on unlabelled video from a single subject, allowing for personalized and adaptive signal regressors.

[30]  arXiv:2404.13493 [pdf, other]
Title: Authentic Emotion Mapping: Benchmarking Facial Expressions in Real News
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we present a novel benchmark for Emotion Recognition using facial landmarks extracted from realistic news videos. Traditional methods relying on RGB images are resource-intensive, whereas our approach with Facial Landmark Emotion Recognition (FLER) offers a simplified yet effective alternative. By leveraging Graph Neural Networks (GNNs) to analyze the geometric and spatial relationships of facial landmarks, our method enhances the understanding and accuracy of emotion recognition. We discuss the advancements and challenges in deep learning techniques for emotion recognition, particularly focusing on Graph Neural Networks (GNNs) and Transformers. Our experimental results demonstrate the viability and potential of our dataset as a benchmark, setting a new direction for future research in emotion recognition technologies. The codes and models are at: https://github.com/wangzhifengharrison/benchmark_real_news

[31]  arXiv:2404.13505 [pdf, other]
Title: Dynamic in Static: Hybrid Visual Correspondence for Self-Supervised Video Object Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Conventional video object segmentation (VOS) methods usually necessitate a substantial volume of pixel-level annotated video data for fully supervised learning. In this paper, we present HVC, a \textbf{h}ybrid static-dynamic \textbf{v}isual \textbf{c}orrespondence framework for self-supervised VOS. HVC extracts pseudo-dynamic signals from static images, enabling an efficient and scalable VOS model. Our approach utilizes a minimalist fully-convolutional architecture to capture static-dynamic visual correspondence in image-cropped views. To achieve this objective, we present a unified self-supervised approach to learn visual representations of static-dynamic feature similarity. Firstly, we establish static correspondence by utilizing a priori coordinate information between cropped views to guide the formation of consistent static feature representations. Subsequently, we devise a concise convolutional layer to capture the forward / backward pseudo-dynamic signals between two views, serving as cues for dynamic representations. Finally, we propose a hybrid visual correspondence loss to learn joint static and dynamic consistency representations. Our approach, without bells and whistles, necessitates only one training session using static image data, significantly reducing memory consumption ($\sim$16GB) and training time ($\sim$\textbf{2h}). Moreover, HVC achieves state-of-the-art performance in several self-supervised VOS benchmarks and additional video label propagation tasks.

[32]  arXiv:2404.13530 [pdf, other]
Title: Listen Then See: Video Alignment with Speaker Attention
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

Video-based Question Answering (Video QA) is a challenging task and becomes even more intricate when addressing Socially Intelligent Question Answering (SIQA). SIQA requires context understanding, temporal reasoning, and the integration of multimodal information, but in addition, it requires processing nuanced human behavior. Furthermore, the complexities involved are exacerbated by the dominance of the primary modality (text) over the others. Thus, there is a need to help the task's secondary modalities to work in tandem with the primary modality. In this work, we introduce a cross-modal alignment and subsequent representation fusion approach that achieves state-of-the-art results (82.06\% accuracy) on the Social IQ 2.0 dataset for SIQA. Our approach exhibits an improved ability to leverage the video modality by using the audio modality as a bridge with the language modality. This leads to enhanced performance by reducing the prevalent issue of language overfitting and resultant video modality bypassing encountered by current existing techniques. Our code and models are publicly available at https://github.com/sts-vlcc/sts-vlcc

[33]  arXiv:2404.13534 [pdf, other]
Title: Motion-aware Latent Diffusion Models for Video Frame Interpolation
Comments: arXiv admin note: substantial text overlap with arXiv:2303.09508 by other authors
Subjects: Computer Vision and Pattern Recognition (cs.CV)

With the advancement of AIGC, video frame interpolation (VFI) has become a crucial component in existing video generation frameworks, attracting widespread research interest. For the VFI task, the motion estimation between neighboring frames plays a crucial role in avoiding motion ambiguity. However, existing VFI methods always struggle to accurately predict the motion information between consecutive frames, and this imprecise estimation leads to blurred and visually incoherent interpolated frames. In this paper, we propose a novel diffusion framework, motion-aware latent diffusion models (MADiff), which is specifically designed for the VFI task. By incorporating motion priors between the conditional neighboring frames with the target interpolated frame predicted throughout the diffusion sampling procedure, MADiff progressively refines the intermediate outcomes, culminating in generating both visually smooth and realistic results. Extensive experiments conducted on benchmark datasets demonstrate that our method achieves state-of-the-art performance significantly outperforming existing approaches, especially under challenging scenarios involving dynamic textures with complex motion.

[34]  arXiv:2404.13541 [pdf, other]
Title: Generalizable Novel-View Synthesis using a Stereo Camera
Comments: Accepted to CVPR 2024. Project page URL: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we propose the first generalizable view synthesis approach that specifically targets multi-view stereo-camera images. Since recent stereo matching has demonstrated accurate geometry prediction, we introduce stereo matching into novel-view synthesis for high-quality geometry reconstruction. To this end, this paper proposes a novel framework, dubbed StereoNeRF, which integrates stereo matching into a NeRF-based generalizable view synthesis approach. StereoNeRF is equipped with three key components to effectively exploit stereo matching in novel-view synthesis: a stereo feature extractor, a depth-guided plane-sweeping, and a stereo depth loss. Moreover, we propose the StereoNVS dataset, the first multi-view dataset of stereo-camera images, encompassing a wide variety of both real and synthetic scenes. Our experimental results demonstrate that StereoNeRF surpasses previous approaches in generalizable view synthesis.

[35]  arXiv:2404.13550 [pdf, other]
Title: Pointsoup: High-Performance and Extremely Low-Decoding-Latency Learned Geometry Codec for Large-Scale Point Cloud Scenes
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Despite considerable progress being achieved in point cloud geometry compression, there still remains a challenge in effectively compressing large-scale scenes with sparse surfaces. Another key challenge lies in reducing decoding latency, a crucial requirement in real-world application. In this paper, we propose Pointsoup, an efficient learning-based geometry codec that attains high-performance and extremely low-decoding-latency simultaneously. Inspired by conventional Trisoup codec, a point model-based strategy is devised to characterize local surfaces. Specifically, skin features are embedded from local windows via an attention-based encoder, and dilated windows are introduced as cross-scale priors to infer the distribution of quantized features in parallel. During decoding, features undergo fast refinement, followed by a folding-based point generator that reconstructs point coordinates with fairly fast speed. Experiments show that Pointsoup achieves state-of-the-art performance on multiple benchmarks with significantly lower decoding complexity, i.e., up to 90$\sim$160$\times$ faster than the G-PCCv23 Trisoup decoder on a comparatively low-end platform (e.g., one RTX 2080Ti). Furthermore, it offers variable-rate control with a single neural model (2.9MB), which is attractive for industrial practitioners.

[36]  arXiv:2404.13555 [pdf, other]
Title: Cell Phone Image-Based Persian Rice Detection and Classification Using Deep Learning Techniques
Comments: 7 pages, 4 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

This study introduces an innovative approach to classifying various types of Persian rice using image-based deep learning techniques, highlighting the practical application of everyday technology in food categorization. Recognizing the diversity of Persian rice and its culinary significance, we leveraged the capabilities of convolutional neural networks (CNNs), specifically by fine-tuning a ResNet model for accurate identification of different rice varieties and employing a U-Net architecture for precise segmentation of rice grains in bulk images. This dual-methodology framework allows for both individual grain classification and comprehensive analysis of bulk rice samples, addressing two crucial aspects of rice quality assessment. Utilizing images captured with consumer-grade cell phones reflects a realistic scenario in which individuals can leverage this technology for assistance with grocery shopping and meal preparation. The dataset, comprising various rice types photographed under natural conditions without professional lighting or equipment, presents a challenging yet practical classification problem. Our findings demonstrate the feasibility of using non-professional images for food classification and the potential of deep learning models, like ResNet and U-Net, to adapt to the nuances of everyday objects and textures. This study contributes to the field by providing insights into the applicability of image-based deep learning in daily life, specifically for enhancing consumer experiences and knowledge in food selection. Furthermore, it opens avenues for extending this approach to other food categories and practical applications, emphasizing the role of accessible technology in bridging the gap between sophisticated computational methods and everyday tasks.

[37]  arXiv:2404.13558 [pdf, other]
Title: LASER: Tuning-Free LLM-Driven Attention Control for Efficient Text-conditioned Image-to-Animation
Comments: 10 pages, 7 figures, submitted to ACMMM 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Revolutionary advancements in text-to-image models have unlocked new dimensions for sophisticated content creation, e.g., text-conditioned image editing, allowing us to edit the diverse images that convey highly complex visual concepts according to the textual guidance. Despite being promising, existing methods focus on texture- or non-rigid-based visual manipulation, which struggles to produce the fine-grained animation of smooth text-conditioned image morphing without fine-tuning, i.e., due to their highly unstructured latent space. In this paper, we introduce a tuning-free LLM-driven attention control framework, encapsulated by the progressive process of LLM planning, prompt-Aware editing, StablE animation geneRation, abbreviated as LASER. LASER employs a large language model (LLM) to refine coarse descriptions into detailed prompts, guiding pre-trained text-to-image models for subsequent image generation. We manipulate the model's spatial features and self-attention mechanisms to maintain animation integrity and enable seamless morphing directly from text prompts, eliminating the need for additional fine-tuning or annotations. Our meticulous control over spatial features and self-attention ensures structural consistency in the images. This paper presents a novel framework integrating LLMs with text-to-image models to create high-quality animations from a single text input. We also propose a Text-conditioned Image-to-Animation Benchmark to validate the effectiveness and efficacy of LASER. Extensive experiments demonstrate that LASER produces impressive, consistent, and efficient results in animation generation, positioning it as a powerful tool for advanced digital content creation.

[38]  arXiv:2404.13564 [pdf, other]
Title: Masked Latent Transformer with the Random Masking Ratio to Advance the Diagnosis of Dental Fluorosis
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Dental fluorosis is a chronic disease caused by long-term overconsumption of fluoride, which leads to changes in the appearance of tooth enamel. It is an important basis for early non-invasive diagnosis of endemic fluorosis. However, even dental professionals may not be able to accurately distinguish dental fluorosis and its severity based on tooth images. Currently, there is still a gap in research on applying deep learning to diagnosing dental fluorosis. Therefore, we construct the first open-source dental fluorosis image dataset (DFID), laying the foundation for deep learning research in this field. To advance the diagnosis of dental fluorosis, we propose a pioneering deep learning model called masked latent transformer with the random masking ratio (MLTrMR). MLTrMR introduces a mask latent modeling scheme based on Vision Transformer to enhance contextual learning of dental fluorosis lesion characteristics. Consisting of a latent embedder, encoder, and decoder, MLTrMR employs the latent embedder to extract latent tokens from the original image, whereas the encoder and decoder comprising the latent transformer (LT) block are used to process unmasked tokens and predict masked tokens, respectively. To mitigate the lack of inductive bias in Vision Transformer, which may result in performance degradation, the LT block introduces latent tokens to enhance the learning capacity of latent lesion features. Furthermore, we design an auxiliary loss function to constrain the parameter update direction of the model. MLTrMR achieves 80.19% accuracy, 75.79% F1, and 81.28% quadratic weighted kappa on DFID, making it state-of-the-art (SOTA).

[39]  arXiv:2404.13565 [pdf, other]
Title: Exploring Diverse Methods in Visual Question Answering
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

This study explores innovative methods for improving Visual Question Answering (VQA) using Generative Adversarial Networks (GANs), autoencoders, and attention mechanisms. Leveraging a balanced VQA dataset, we investigate three distinct strategies. Firstly, GAN-based approaches aim to generate answer embeddings conditioned on image and question inputs, showing potential but struggling with more complex tasks. Secondly, autoencoder-based techniques focus on learning optimal embeddings for questions and images, achieving comparable results with GAN due to better ability on complex questions. Lastly, attention mechanisms, incorporating Multimodal Compact Bilinear pooling (MCB), address language priors and attention modeling, albeit with a complexity-performance trade-off. This study underscores the challenges and opportunities in VQA and suggests avenues for future research, including alternative GAN formulations and attentional mechanisms.

[40]  arXiv:2404.13573 [pdf, other]
Title: Exploring AIGC Video Quality: A Focus on Visual Harmony, Video-Text Consistency and Domain Distribution Gap
Comments: 9 pages, 3 figures, 3 tables. Accepted by CVPR2024 Workshop (3rd place of NTIRE2024 Quality Assessment for AI-Generated Content - Track 2 Video)
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The recent advancements in Text-to-Video Artificial Intelligence Generated Content (AIGC) have been remarkable. Compared with traditional videos, the assessment of AIGC videos encounters various challenges: visual inconsistency that defy common sense, discrepancies between content and the textual prompt, and distribution gap between various generative models, etc. Target at these challenges, in this work, we categorize the assessment of AIGC video quality into three dimensions: visual harmony, video-text consistency, and domain distribution gap. For each dimension, we design specific modules to provide a comprehensive quality assessment of AIGC videos. Furthermore, our research identifies significant variations in visual quality, fluidity, and style among videos generated by different text-to-video models. Predicting the source generative model can make the AIGC video features more discriminative, which enhances the quality assessment performance. The proposed method was used in the third-place winner of the NTIRE 2024 Quality Assessment for AI-Generated Content - Track 2 Video, demonstrating its effectiveness.

[41]  arXiv:2404.13576 [pdf, other]
Title: I2CANSAY:Inter-Class Analogical Augmentation and Intra-Class Significance Analysis for Non-Exemplar Online Task-Free Continual Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Online task-free continual learning (OTFCL) is a more challenging variant of continual learning which emphasizes the gradual shift of task boundaries and learns in an online mode. Existing methods rely on a memory buffer composed of old samples to prevent forgetting. However,the use of memory buffers not only raises privacy concerns but also hinders the efficient learning of new samples. To address this problem, we propose a novel framework called I2CANSAY that gets rid of the dependence on memory buffers and efficiently learns the knowledge of new data from one-shot samples. Concretely, our framework comprises two main modules. Firstly, the Inter-Class Analogical Augmentation (ICAN) module generates diverse pseudo-features for old classes based on the inter-class analogy of feature distributions for different new classes, serving as a substitute for the memory buffer. Secondly, the Intra-Class Significance Analysis (ISAY) module analyzes the significance of attributes for each class via its distribution standard deviation, and generates the importance vector as a correction bias for the linear classifier, thereby enhancing the capability of learning from new samples. We run our experiments on four popular image classification datasets: CoRe50, CIFAR-10, CIFAR-100, and CUB-200, our approach outperforms the prior state-of-the-art by a large margin.

[42]  arXiv:2404.13579 [pdf, other]
Title: LTOS: Layout-controllable Text-Object Synthesis via Adaptive Cross-attention Fusions
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Controllable text-to-image generation synthesizes visual text and objects in images with certain conditions, which are frequently applied to emoji and poster generation. Visual text rendering and layout-to-image generation tasks have been popular in controllable text-to-image generation. However, each of these tasks typically focuses on single modality generation or rendering, leaving yet-to-be-bridged gaps between the approaches correspondingly designed for each of the tasks. In this paper, we combine text rendering and layout-to-image generation tasks into a single task: layout-controllable text-object synthesis (LTOS) task, aiming at synthesizing images with object and visual text based on predefined object layout and text contents. As compliant datasets are not readily available for our LTOS task, we construct a layout-aware text-object synthesis dataset, containing elaborate well-aligned labels of visual text and object information. Based on the dataset, we propose a layout-controllable text-object adaptive fusion (TOF) framework, which generates images with clear, legible visual text and plausible objects. We construct a visual-text rendering module to synthesize text and employ an object-layout control module to generate objects while integrating the two modules to harmoniously generate and integrate text content and objects in images. To better the image-text integration, we propose a self-adaptive cross-attention fusion module that helps the image generation to attend more to important text information. Within such a fusion module, we use a self-adaptive learnable factor to learn to flexibly control the influence of cross-attention outputs on image generation. Experimental results show that our method outperforms the state-of-the-art in LTOS, text rendering, and layout-to-image tasks, enabling harmonious visual text rendering and object generation.

[43]  arXiv:2404.13584 [pdf, other]
Title: Rethink Arbitrary Style Transfer with Transformer and Contrastive Learning
Comments: Accepted by CVIU
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Arbitrary style transfer holds widespread attention in research and boasts numerous practical applications. The existing methods, which either employ cross-attention to incorporate deep style attributes into content attributes or use adaptive normalization to adjust content features, fail to generate high-quality stylized images. In this paper, we introduce an innovative technique to improve the quality of stylized images. Firstly, we propose Style Consistency Instance Normalization (SCIN), a method to refine the alignment between content and style features. In addition, we have developed an Instance-based Contrastive Learning (ICL) approach designed to understand the relationships among various styles, thereby enhancing the quality of the resulting stylized images. Recognizing that VGG networks are more adept at extracting classification features and need to be better suited for capturing style features, we have also introduced the Perception Encoder (PE) to capture style features. Extensive experiments demonstrate that our proposed method generates high-quality stylized images and effectively prevents artifacts compared with the existing state-of-the-art methods.

[44]  arXiv:2404.13591 [pdf, other]
Title: MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

While multi-modal large language models (MLLMs) have shown significant progress on many popular visual reasoning benchmarks, whether they possess abstract visual reasoning abilities remains an open question. Similar to the Sudoku puzzles, abstract visual reasoning (AVR) problems require finding high-level patterns (e.g., repetition constraints) that control the input shapes (e.g., digits) in a specific task configuration (e.g., matrix). However, existing AVR benchmarks only considered a limited set of patterns (addition, conjunction), input shapes (rectangle, square), and task configurations (3 by 3 matrices). To evaluate MLLMs' reasoning abilities comprehensively, we introduce MARVEL, a multidimensional AVR benchmark with 770 puzzles composed of six core knowledge patterns, geometric and abstract shapes, and five different task configurations. To inspect whether the model accuracy is grounded in perception and reasoning, MARVEL complements the general AVR question with perception questions in a hierarchical evaluation framework. We conduct comprehensive experiments on MARVEL with nine representative MLLMs in zero-shot and few-shot settings. Our experiments reveal that all models show near-random performance on the AVR question, with significant performance gaps (40%) compared to humans across all patterns and task configurations. Further analysis of perception questions reveals that MLLMs struggle to comprehend the visual features (near-random performance) and even count the panels in the puzzle ( <45%), hindering their ability for abstract reasoning. We release our entire code and dataset.

[45]  arXiv:2404.13594 [pdf, other]
Title: Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers
Comments: NAACL 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

An effective method for combining frozen large language models (LLM) and visual encoders involves a resampler module that creates a `visual prompt' which is provided to the LLM, along with the textual prompt. While this approach has enabled impressive performance across many coarse-grained tasks like image captioning and visual question answering, more fine-grained tasks that require spatial understanding have not been thoroughly examined. In this paper, we use \textit{diagnostic classifiers} to measure the extent to which the visual prompt produced by the resampler encodes spatial information. Our results show that this information is largely absent from the resampler output when kept frozen during training of the classifiers. However, when the resampler and classifier are trained jointly, we observe a significant performance boost. This shows that the compression achieved by the resamplers can in principle encode the requisite spatial information, but that more object-aware objectives are needed at the pretraining stage to facilitate this capability

[46]  arXiv:2404.13605 [pdf, other]
Title: Turb-Seg-Res: A Segment-then-Restore Pipeline for Dynamic Videos with Atmospheric Turbulence
Comments: CVPR 2024 Paper
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Tackling image degradation due to atmospheric turbulence, particularly in dynamic environment, remains a challenge for long-range imaging systems. Existing techniques have been primarily designed for static scenes or scenes with small motion. This paper presents the first segment-then-restore pipeline for restoring the videos of dynamic scenes in turbulent environment. We leverage mean optical flow with an unsupervised motion segmentation method to separate dynamic and static scene components prior to restoration. After camera shake compensation and segmentation, we introduce foreground/background enhancement leveraging the statistics of turbulence strength and a transformer model trained on a novel noise-based procedural turbulence generator for fast dataset augmentation. Benchmarked against existing restoration methods, our approach restores most of the geometric distortion and enhances sharpness for videos. We make our code, simulator, and data publicly available to advance the field of video restoration from turbulence: riponcs.github.io/TurbSegRes

[47]  arXiv:2404.13611 [pdf, other]
Title: Video sentence grounding with temporally global textual knowledge
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

Temporal sentence grounding involves the retrieval of a video moment with a natural language query. Many existing works directly incorporate the given video and temporally localized query for temporal grounding, overlooking the inherent domain gap between different modalities. In this paper, we utilize pseudo-query features containing extensive temporally global textual knowledge sourced from the same video-query pair, to enhance the bridging of domain gaps and attain a heightened level of similarity between multi-modal features. Specifically, we propose a Pseudo-query Intermediary Network (PIN) to achieve an improved alignment of visual and comprehensive pseudo-query features within the feature space through contrastive learning. Subsequently, we utilize learnable prompts to encapsulate the knowledge of pseudo-queries, propagating them into the textual encoder and multi-modal fusion module, further enhancing the feature alignment between visual and language for better temporal grounding. Extensive experiments conducted on the Charades-STA and ActivityNet-Captions datasets demonstrate the effectiveness of our method.

[48]  arXiv:2404.13621 [pdf, other]
Title: Attack on Scene Flow using Point Clouds
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)

Deep neural networks have made significant advancements in accurately estimating scene flow using point clouds, which is vital for many applications like video analysis, action recognition, and navigation. Robustness of these techniques, however, remains a concern, particularly in the face of adversarial attacks that have been proven to deceive state-of-the-art deep neural networks in many domains. Surprisingly, the robustness of scene flow networks against such attacks has not been thoroughly investigated. To address this problem, the proposed approach aims to bridge this gap by introducing adversarial white-box attacks specifically tailored for scene flow networks. Experimental results show that the generated adversarial examples obtain up to 33.7 relative degradation in average end-point error on the KITTI and FlyingThings3D datasets. The study also reveals the significant impact that attacks targeting point clouds in only one dimension or color channel have on average end-point error. Analyzing the success and failure of these attacks on the scene flow networks and their 2D optical flow network variants show a higher vulnerability for the optical flow networks.

[49]  arXiv:2404.13648 [pdf, other]
Title: Data-independent Module-aware Pruning for Hierarchical Vision Transformers
Comments: Accepted by ICLR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Hierarchical vision transformers (ViTs) have two advantages over conventional ViTs. First, hierarchical ViTs achieve linear computational complexity with respect to image size by local self-attention. Second, hierarchical ViTs create hierarchical feature maps by merging image patches in deeper layers for dense prediction. However, existing pruning methods ignore the unique properties of hierarchical ViTs and use the magnitude value as the weight importance. This approach leads to two main drawbacks. First, the "local" attention weights are compared at a "global" level, which may cause some "locally" important weights to be pruned due to their relatively small magnitude "globally". The second issue with magnitude pruning is that it fails to consider the distinct weight distributions of the network, which are essential for extracting coarse to fine-grained features at various hierarchical levels.
To solve the aforementioned issues, we have developed a Data-independent Module-Aware Pruning method (DIMAP) to compress hierarchical ViTs. To ensure that "local" attention weights at different hierarchical levels are compared fairly in terms of their contribution, we treat them as a module and examine their contribution by analyzing their information distortion. Furthermore, we introduce a novel weight metric that is solely based on weights and does not require input images, thereby eliminating the dependence on the patch merging process. Our method validates its usefulness and strengths on Swin Transformers of different sizes on ImageNet-1k classification. Notably, the top-5 accuracy drop is only 0.07% when we remove 52.5% FLOPs and 52.7% parameters of Swin-B. When we reduce 33.2% FLOPs and 33.2% parameters of Swin-S, we can even achieve a 0.8% higher relative top-5 accuracy than the original model. Code is available at: https://github.com/he-y/Data-independent-Module-Aware-Pruning

[50]  arXiv:2404.13657 [pdf, other]
Title: MLP: Motion Label Prior for Temporal Sentence Localization in Untrimmed 3D Human Motions
Comments: 13 pages, 9 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

In this paper, we address the unexplored question of temporal sentence localization in human motions (TSLM), aiming to locate a target moment from a 3D human motion that semantically corresponds to a text query. Considering that 3D human motions are captured using specialized motion capture devices, motions with only a few joints lack complex scene information like objects and lighting. Due to this character, motion data has low contextual richness and semantic ambiguity between frames, which limits the accuracy of predictions made by current video localization frameworks extended to TSLM to only a rough level. To refine this, we devise two novel label-prior-assisted training schemes: one embed prior knowledge of foreground and background to highlight the localization chances of target moments, and the other forces the originally rough predictions to overlap with the more accurate predictions obtained from the flipped start/end prior label sequences during recovery training. We show that injecting label-prior knowledge into the model is crucial for improving performance at high IoU. In our constructed TSLM benchmark, our model termed MLP achieves a recall of 44.13 at IoU@0.7 on the BABEL dataset and 71.17 on HumanML3D (Restore), outperforming prior works. Finally, we showcase the potential of our approach in corpus-level moment retrieval. Our source code is openly accessible at https://github.com/eanson023/mlp.

[51]  arXiv:2404.13659 [pdf, other]
Title: LMFNet: An Efficient Multimodal Fusion Approach for Semantic Segmentation in High-Resolution Remote Sensing
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Despite the rapid evolution of semantic segmentation for land cover classification in high-resolution remote sensing imagery, integrating multiple data modalities such as Digital Surface Model (DSM), RGB, and Near-infrared (NIR) remains a challenge. Current methods often process only two types of data, missing out on the rich information that additional modalities can provide. Addressing this gap, we propose a novel \textbf{L}ightweight \textbf{M}ultimodal data \textbf{F}usion \textbf{Net}work (LMFNet) to accomplish the tasks of fusion and semantic segmentation of multimodal remote sensing images. LMFNet uniquely accommodates various data types simultaneously, including RGB, NirRG, and DSM, through a weight-sharing, multi-branch vision transformer that minimizes parameter count while ensuring robust feature extraction. Our proposed multimodal fusion module integrates a \textit{Multimodal Feature Fusion Reconstruction Layer} and \textit{Multimodal Feature Self-Attention Fusion Layer}, which can reconstruct and fuse multimodal features. Extensive testing on public datasets such as US3D, ISPRS Potsdam, and ISPRS Vaihingen demonstrates the effectiveness of LMFNet. Specifically, it achieves a mean Intersection over Union ($mIoU$) of 85.09\% on the US3D dataset, marking a significant improvement over existing methods. Compared to unimodal approaches, LMFNet shows a 10\% enhancement in $mIoU$ with only a 0.5M increase in parameter count. Furthermore, against bimodal methods, our approach with trilateral inputs enhances $mIoU$ by 0.46 percentage points.

[52]  arXiv:2404.13667 [pdf, other]
Title: MathNet: A Data-Centric Approach for Printed Mathematical Expression Recognition
Comments: 12 pages, 6 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Printed mathematical expression recognition (MER) models are usually trained and tested using LaTeX-generated mathematical expressions (MEs) as input and the LaTeX source code as ground truth. As the same ME can be generated by various different LaTeX source codes, this leads to unwanted variations in the ground truth data that bias test performance results and hinder efficient learning. In addition, the use of only one font to generate the MEs heavily limits the generalization of the reported results to realistic scenarios. We propose a data-centric approach to overcome this problem, and present convincing experimental results: Our main contribution is an enhanced LaTeX normalization to map any LaTeX ME to a canonical form. Based on this process, we developed an improved version of the benchmark dataset im2latex-100k, featuring 30 fonts instead of one. Second, we introduce the real-world dataset realFormula, with MEs extracted from papers. Third, we developed a MER model, MathNet, based on a convolutional vision transformer, with superior results on all four test sets (im2latex-100k, im2latexv2, realFormula, and InftyMDB-1), outperforming the previous state of the art by up to 88.3%.

[53]  arXiv:2404.13671 [pdf, other]
Title: FiLo: Zero-Shot Anomaly Detection by Fine-Grained Description and High-Quality Localization
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Zero-shot anomaly detection (ZSAD) methods entail detecting anomalies directly without access to any known normal or abnormal samples within the target item categories. Existing approaches typically rely on the robust generalization capabilities of multimodal pretrained models, computing similarities between manually crafted textual features representing "normal" or "abnormal" semantics and image features to detect anomalies and localize anomalous patches. However, the generic descriptions of "abnormal" often fail to precisely match diverse types of anomalies across different object categories. Additionally, computing feature similarities for single patches struggles to pinpoint specific locations of anomalies with various sizes and scales. To address these issues, we propose a novel ZSAD method called FiLo, comprising two components: adaptively learned Fine-Grained Description (FG-Des) and position-enhanced High-Quality Localization (HQ-Loc). FG-Des introduces fine-grained anomaly descriptions for each category using Large Language Models (LLMs) and employs adaptively learned textual templates to enhance the accuracy and interpretability of anomaly detection. HQ-Loc, utilizing Grounding DINO for preliminary localization, position-enhanced text prompts, and Multi-scale Multi-shape Cross-modal Interaction (MMCI) module, facilitates more accurate localization of anomalies of different sizes and shapes. Experimental results on datasets like MVTec and VisA demonstrate that FiLo significantly improves the performance of ZSAD in both detection and localization, achieving state-of-the-art performance with an image-level AUC of 83.9% and a pixel-level AUC of 95.9% on the VisA dataset.

[54]  arXiv:2404.13677 [pdf, other]
Title: A Dataset and Model for Realistic License Plate Deblurring
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Vehicle license plate recognition is a crucial task in intelligent traffic management systems. However, the challenge of achieving accurate recognition persists due to motion blur from fast-moving vehicles. Despite the widespread use of image synthesis approaches in existing deblurring and recognition algorithms, their effectiveness in real-world scenarios remains unproven. To address this, we introduce the first large-scale license plate deblurring dataset named License Plate Blur (LPBlur), captured by a dual-camera system and processed through a post-processing pipeline to avoid misalignment issues. Then, we propose a License Plate Deblurring Generative Adversarial Network (LPDGAN) to tackle the license plate deblurring: 1) a Feature Fusion Module to integrate multi-scale latent codes; 2) a Text Reconstruction Module to restore structure through textual modality; 3) a Partition Discriminator Module to enhance the model's perception of details in each letter. Extensive experiments validate the reliability of the LPBlur dataset for both model training and testing, showcasing that our proposed model outperforms other state-of-the-art motion deblurring methods in realistic license plate deblurring scenarios. The dataset and code are available at https://github.com/haoyGONG/LPDGAN.

[55]  arXiv:2404.13679 [pdf, other]
Title: GScream: Learning 3D Geometry and Feature Consistent Gaussian Splatting for Object Removal
Comments: Project Page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper tackles the intricate challenge of object removal to update the radiance field using the 3D Gaussian Splatting. The main challenges of this task lie in the preservation of geometric consistency and the maintenance of texture coherence in the presence of the substantial discrete nature of Gaussian primitives. We introduce a robust framework specifically designed to overcome these obstacles. The key insight of our approach is the enhancement of information exchange among visible and invisible areas, facilitating content restoration in terms of both geometry and texture. Our methodology begins with optimizing the positioning of Gaussian primitives to improve geometric consistency across both removed and visible areas, guided by an online registration process informed by monocular depth estimation. Following this, we employ a novel feature propagation mechanism to bolster texture coherence, leveraging a cross-attention design that bridges sampling Gaussians from both uncertain and certain areas. This innovative approach significantly refines the texture coherence within the final radiance field. Extensive experiments validate that our method not only elevates the quality of novel view synthesis for scenes undergoing object removal but also showcases notable efficiency gains in training and rendering speeds.

[56]  arXiv:2404.13680 [pdf, other]
Title: PoseAnimate: Zero-shot high fidelity pose controllable character animation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Image-to-video(I2V) generation aims to create a video sequence from a single image, which requires high temporal coherence and visual fidelity with the source image.However, existing approaches suffer from character appearance inconsistency and poor preservation of fine details. Moreover, they require a large amount of video data for training, which can be computationally demanding.To address these limitations,we propose PoseAnimate, a novel zero-shot I2V framework for character animation.PoseAnimate contains three key components: 1) Pose-Aware Control Module (PACM) incorporates diverse pose signals into conditional embeddings, to preserve character-independent content and maintain precise alignment of actions.2) Dual Consistency Attention Module (DCAM) enhances temporal consistency, and retains character identity and intricate background details.3) Mask-Guided Decoupling Module (MGDM) refines distinct feature perception, improving animation fidelity by decoupling the character and background.We also propose a Pose Alignment Transition Algorithm (PATA) to ensure smooth action transition.Extensive experiment results demonstrate that our approach outperforms the state-of-the-art training-based methods in terms of character consistency and detail fidelity. Moreover, it maintains a high level of temporal coherence throughout the generated animations.

[57]  arXiv:2404.13686 [pdf, other]
Title: Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recently, a series of diffusion-aware distillation algorithms have emerged to alleviate the computational overhead associated with the multi-step inference process of Diffusion Models (DMs). Current distillation techniques often dichotomize into two distinct aspects: i) ODE Trajectory Preservation; and ii) ODE Trajectory Reformulation. However, these approaches suffer from severe performance degradation or domain shifts. To address these limitations, we propose Hyper-SD, a novel framework that synergistically amalgamates the advantages of ODE Trajectory Preservation and Reformulation, while maintaining near-lossless performance during step compression. Firstly, we introduce Trajectory Segmented Consistency Distillation to progressively perform consistent distillation within pre-defined time-step segments, which facilitates the preservation of the original ODE trajectory from a higher-order perspective. Secondly, we incorporate human feedback learning to boost the performance of the model in a low-step regime and mitigate the performance loss incurred by the distillation process. Thirdly, we integrate score distillation to further improve the low-step generation capability of the model and offer the first attempt to leverage a unified LoRA to support the inference process at all steps. Extensive experiments and user studies demonstrate that Hyper-SD achieves SOTA performance from 1 to 8 inference steps for both SDXL and SD1.5. For example, Hyper-SDXL surpasses SDXL-Lightning by +0.68 in CLIP Score and +0.51 in Aes Score in the 1-step inference.

[58]  arXiv:2404.13691 [pdf, other]
Title: A Complete System for Automated 3D Semantic-Geometric Mapping of Corrosion in Industrial Environments
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Corrosion, a naturally occurring process leading to the deterioration of metallic materials, demands diligent detection for quality control and the preservation of metal-based objects, especially within industrial contexts. Traditional techniques for corrosion identification, including ultrasonic testing, radio-graphic testing, and magnetic flux leakage, necessitate the deployment of expensive and bulky equipment on-site for effective data acquisition. An unexplored alternative involves employing lightweight, conventional camera systems, and state-of-the-art computer vision methods for its identification.
In this work, we propose a complete system for semi-automated corrosion identification and mapping in industrial environments. We leverage recent advances in LiDAR-based methods for localization and mapping, with vision-based semantic segmentation deep learning techniques, in order to build semantic-geometric maps of industrial environments. Unlike previous corrosion identification systems available in the literature, our designed multi-modal system is low-cost, portable, semi-autonomous and allows collecting large datasets by untrained personnel.
A set of experiments in an indoor laboratory environment, demonstrate quantitatively the high accuracy of the employed LiDAR based 3D mapping and localization system, with less then $0.05m$ and 0.02m average absolute and relative pose errors. Also, our data-driven semantic segmentation model, achieves around 70\% precision when trained with our pixel-wise manually annotated dataset.

[59]  arXiv:2404.13692 [pdf, other]
Title: A sustainable development perspective on urban-scale roof greening priorities and benefits
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Greenspaces are tightly linked to human well-being. Yet, rapid urbanization has exacerbated greenspace exposure inequality and declining human life quality. Roof greening has been recognized as an effective strategy to mitigate these negative impacts. Understanding priorities and benefits is crucial to promoting green roofs. Here, using geospatial big data, we conduct an urban-scale assessment of roof greening at a single building level in Hong Kong from a sustainable development perspective. We identify that 85.3\% of buildings reveal potential and urgent demand for roof greening. We further find green roofs could increase greenspace exposure by \textasciitilde61\% and produce hundreds of millions (HK\$) in economic benefits annually but play a small role in urban heat mitigation (\textasciitilde0.15\degree{C}) and annual carbon emission offsets (\textasciitilde0.8\%). Our study offers a comprehensive assessment of roof greening, which could provide reference for sustainable development in cities worldwide, from data utilization to solutions and findings.

[60]  arXiv:2404.13701 [pdf, other]
Title: Semantic-Rearrangement-Based Multi-Level Alignment for Domain Generalized Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Domain generalized semantic segmentation is an essential computer vision task, for which models only leverage source data to learn the capability of generalized semantic segmentation towards the unseen target domains. Previous works typically address this challenge by global style randomization or feature regularization. In this paper, we argue that given the observation that different local semantic regions perform different visual characteristics from the source domain to the target domain, methods focusing on global operations are hard to capture such regional discrepancies, thus failing to construct domain-invariant representations with the consistency from local to global level. Therefore, we propose the Semantic-Rearrangement-based Multi-Level Alignment (SRMA) to overcome this problem. SRMA first incorporates a Semantic Rearrangement Module (SRM), which conducts semantic region randomization to enhance the diversity of the source domain sufficiently. A Multi-Level Alignment module (MLA) is subsequently proposed with the help of such diversity to establish the global-regional-local consistent domain-invariant representations. By aligning features across randomized samples with domain-neutral knowledge at multiple levels, SRMA provides a more robust way to handle the source-target domain gap. Extensive experiments demonstrate the superiority of SRMA over the current state-of-the-art works on various benchmarks.

[61]  arXiv:2404.13706 [pdf, other]
Title: Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Motivated by ethical and legal concerns, the scientific community is actively developing methods to limit the misuse of Text-to-Image diffusion models for reproducing copyrighted, violent, explicit, or personal information in the generated images. Simultaneously, researchers put these newly developed safety measures to the test by assuming the role of an adversary to find vulnerabilities and backdoors in them. We use compositional property of diffusion models, which allows to leverage multiple prompts in a single image generation. This property allows us to combine other concepts, that should not have been affected by the inhibition, to reconstruct the vector, responsible for target concept generation, even though the direct computation of this vector is no longer accessible. We provide theoretical and empirical evidence why the proposed attacks are possible and discuss the implications of these findings for safe model deployment. We argue that it is essential to consider all possible approaches to image generation with diffusion models that can be employed by an adversary. Our work opens up the discussion about the implications of concept arithmetics and compositional inference for safety mechanisms in diffusion models.
Content Advisory: This paper contains discussions and model-generated content that may be considered offensive. Reader discretion is advised.
Project page: https://cs-people.bu.edu/vpetsiuk/arc

[62]  arXiv:2404.13710 [pdf, other]
Title: SVGEditBench: A Benchmark Dataset for Quantitative Assessment of LLM's SVG Editing Capabilities
Comments: Accepted to Workshop on Graphic Design Understanding and Generation (GDUG), a CVPR2024 workshop. Dataset: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Text-to-image models have shown progress in recent years. Along with this progress, generating vector graphics from text has also advanced. SVG is a popular format for vector graphics, and SVG represents a scene with XML text. Therefore, Large Language Models can directly process SVG code. Taking this into account, we focused on editing SVG with LLMs. For quantitative evaluation of LLMs' ability to edit SVG, we propose SVGEditBench. SVGEditBench is a benchmark for assessing the LLMs' ability to edit SVG code. We also show the GPT-4 and GPT-3.5 results when evaluated on the proposed benchmark. In the experiments, GPT-4 showed superior performance to GPT-3.5 both quantitatively and qualitatively. The dataset is available at https://github.com/mti-lab/SVGEditBench.

[63]  arXiv:2404.13711 [pdf, other]
Title: ArtNeRF: A Stylized Neural Field for 3D-Aware Cartoonized Face Synthesis
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent advances in generative visual models and neural radiance fields have greatly boosted 3D-aware image synthesis and stylization tasks. However, previous NeRF-based work is limited to single scene stylization, training a model to generate 3D-aware cartoon faces with arbitrary styles remains unsolved. We propose ArtNeRF, a novel face stylization framework derived from 3D-aware GAN to tackle this problem. In this framework, we utilize an expressive generator to synthesize stylized faces and a triple-branch discriminator module to improve the visual quality and style consistency of the generated faces. Specifically, a style encoder based on contrastive learning is leveraged to extract robust low-dimensional embeddings of style images, empowering the generator with the knowledge of various styles. To smooth the training process of cross-domain transfer learning, we propose an adaptive style blending module which helps inject style information and allows users to freely tune the level of stylization. We further introduce a neural rendering module to achieve efficient real-time rendering of images with higher resolutions. Extensive experiments demonstrate that ArtNeRF is versatile in generating high-quality 3D-aware cartoon faces with arbitrary styles.

[64]  arXiv:2404.13745 [pdf, other]
Title: A Nasal Cytology Dataset for Object Detection and Deep Learning
Comments: Pre Print almost ready to be submitted
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Nasal Cytology is a new and efficient clinical technique to diagnose rhinitis and allergies that is not much widespread due to the time-consuming nature of cell counting; that is why AI-aided counting could be a turning point for the diffusion of this technique. In this article we present the first dataset of rhino-cytological field images: the NCD (Nasal Cytology Dataset), aimed to train and deploy Object Detection models to support physicians and biologists during clinical practice. The real distribution of the cytotypes, populating the nasal mucosa has been replicated, sampling images from slides of clinical patients, and manually annotating each cell found on them. The correspondent object detection task presents non'trivial issues associated with the strong class imbalancement, involving the rarest cell types. This work contributes to some of open challenges by presenting a novel machine learning-based approach to aid the automated detection and classification of nasal mucosa cells: the DETR and YOLO models shown good performance in detecting cells and classifying them correctly, revealing great potential to accelerate the work of rhinology experts.

[65]  arXiv:2404.13766 [pdf, other]
Title: Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Current diffusion models create photorealistic images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image. This is evidenced by our novel image-graph alignment model called EPViT (Edge Prediction Vision Transformer) for the evaluation of image-text alignment. To alleviate the above problem, we propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence. Additionally, the syntax structure of the prompt helps to disentangle the multimodal CLIP embeddings that are commonly used in T2I generation. The resulting DisCLIP embeddings and FCA are easily integrated in state-of-the-art diffusion models without additional training of these models. We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.\footnote{Code and data will be made available upon acceptance.

[66]  arXiv:2404.13770 [pdf, other]
Title: EncodeNet: A Framework for Boosting DNN Accuracy with Entropy-driven Generalized Converting Autoencoder
Comments: 15 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Image classification is a fundamental task in computer vision, and the quest to enhance DNN accuracy without inflating model size or latency remains a pressing concern. We make a couple of advances in this regard, leading to a novel EncodeNet design and training framework. The first advancement involves Converting Autoencoders, a novel approach that transforms images into an easy-to-classify image of its class. Our prior work that applied the Converting Autoencoder and a simple classifier in tandem achieved moderate accuracy over simple datasets, such as MNIST and FMNIST. However, on more complex datasets like CIFAR-10, the Converting Autoencoder has a large reconstruction loss, making it unsuitable for enhancing DNN accuracy. To address these limitations, we generalize the design of Converting Autoencoders by leveraging a larger class of DNNs, those with architectures comprising feature extraction layers followed by classification layers. We incorporate a generalized algorithmic design of the Converting Autoencoder and intraclass clustering to identify representative images, leading to optimized image feature learning. Next, we demonstrate the effectiveness of our EncodeNet design and training framework, improving the accuracy of well-trained baseline DNNs while maintaining the overall model size. EncodeNet's building blocks comprise the trained encoder from our generalized Converting Autoencoders transferring knowledge to a lightweight classifier network - also extracted from the baseline DNN. Our experimental results demonstrate that EncodeNet improves the accuracy of VGG16 from 92.64% to 94.05% on CIFAR-10 and RestNet20 from 74.56% to 76.04% on CIFAR-100. It outperforms state-of-the-art techniques that rely on knowledge distillation and attention mechanisms, delivering higher accuracy for models of comparable size.

[67]  arXiv:2404.13788 [pdf, other]
Title: AnyPattern: Towards In-context Image Copy Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

This paper explores in-context learning for image copy detection (ICD), i.e., prompting an ICD model to identify replicated images with new tampering patterns without the need for additional training. The prompts (or the contexts) are from a small set of image-replica pairs that reflect the new patterns and are used at inference time. Such in-context ICD has good realistic value, because it requires no fine-tuning and thus facilitates fast reaction against the emergence of unseen patterns. To accommodate the "seen $\rightarrow$ unseen" generalization scenario, we construct the first large-scale pattern dataset named AnyPattern, which has the largest number of tamper patterns ($90$ for training and $10$ for testing) among all the existing ones. We benchmark AnyPattern with popular ICD methods and reveal that existing methods barely generalize to novel tamper patterns. We further propose a simple in-context ICD method named ImageStacker. ImageStacker learns to select the most representative image-replica pairs and employs them as the pattern prompts in a stacking manner (rather than the popular concatenation manner). Experimental results show (1) training with our large-scale dataset substantially benefits pattern generalization ($+26.66 \%$ $\mu AP$), (2) the proposed ImageStacker facilitates effective in-context ICD (another round of $+16.75 \%$ $\mu AP$), and (3) AnyPattern enables in-context ICD, i.e. without such a large-scale dataset, in-context learning does not emerge even with our ImageStacker. The project (including the proposed dataset AnyPattern and the code for ImageStacker) is publicly available at https://anypattern.github.io under the MIT Licence.

[68]  arXiv:2404.13791 [pdf, other]
Title: Universal Fingerprint Generation: Controllable Diffusion Model with Multimodal Conditions
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

The utilization of synthetic data for fingerprint recognition has garnered increased attention due to its potential to alleviate privacy concerns surrounding sensitive biometric data. However, current methods for generating fingerprints have limitations in creating impressions of the same finger with useful intra-class variations. To tackle this challenge, we present GenPrint, a framework to produce fingerprint images of various types while maintaining identity and offering humanly understandable control over different appearance factors such as fingerprint class, acquisition type, sensor device, and quality level. Unlike previous fingerprint generation approaches, GenPrint is not confined to replicating style characteristics from the training dataset alone: it enables the generation of novel styles from unseen devices without requiring additional fine-tuning. To accomplish these objectives, we developed GenPrint using latent diffusion models with multimodal conditions (text and image) for consistent generation of style and identity. Our experiments leverage a variety of publicly available datasets for training and evaluation. Results demonstrate the benefits of GenPrint in terms of identity preservation, explainable control, and universality of generated images. Importantly, the GenPrint-generated images yield comparable or even superior accuracy to models trained solely on real data and further enhances performance when augmenting the diversity of existing real fingerprint datasets.

[69]  arXiv:2404.13798 [pdf, other]
Title: Enforcing Conditional Independence for Fair Representation Learning and Causal Image Generation
Comments: To appear at the 2024 IEEE CVPR Workshop on Fair, Data-Efficient, and Trusted Computer Vision
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Conditional independence (CI) constraints are critical for defining and evaluating fairness in machine learning, as well as for learning unconfounded or causal representations. Traditional methods for ensuring fairness either blindly learn invariant features with respect to a protected variable (e.g., race when classifying sex from face images) or enforce CI relative to the protected attribute only on the model output (e.g., the sex label). Neither of these methods are effective in enforcing CI in high-dimensional feature spaces. In this paper, we focus on a nascent approach characterizing the CI constraint in terms of two Jensen-Shannon divergence terms, and we extend it to high-dimensional feature spaces using a novel dynamic sampling strategy. In doing so, we introduce a new training paradigm that can be applied to any encoder architecture. We are able to enforce conditional independence of the diffusion autoencoder latent representation with respect to any protected attribute under the equalized odds constraint and show that this approach enables causal image generation with controllable latent spaces. Our experimental results demonstrate that our approach can achieve high accuracy on downstream tasks while upholding equality of odds.

[70]  arXiv:2404.13807 [pdf, other]
Title: FaceFolds: Meshed Radiance Manifolds for Efficient Volumetric Rendering of Dynamic Faces
Comments: In Proceedings of the ACM in Computer Graphics and Interactive Techniques, 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

3D rendering of dynamic face captures is a challenging problem, and it demands improvements on several fronts$\unicode{x2014}$photorealism, efficiency, compatibility, and configurability. We present a novel representation that enables high-quality volumetric rendering of an actor's dynamic facial performances with minimal compute and memory footprint. It runs natively on commodity graphics soft- and hardware, and allows for a graceful trade-off between quality and efficiency. Our method utilizes recent advances in neural rendering, particularly learning discrete radiance manifolds to sparsely sample the scene to model volumetric effects. We achieve efficient modeling by learning a single set of manifolds for the entire dynamic sequence, while implicitly modeling appearance changes as temporal canonical texture. We export a single layered mesh and view-independent RGBA texture video that is compatible with legacy graphics renderers without additional ML integration. We demonstrate our method by rendering dynamic face captures of real actors in a game engine, at comparable photorealism to state-of-the-art neural rendering techniques at previously unseen frame rates.

[71]  arXiv:2404.13816 [pdf, other]
Title: Neural Radiance Field in Autonomous Driving: A Survey
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Neural Radiance Field (NeRF) has garnered significant attention from both academia and industry due to its intrinsic advantages, particularly its implicit representation and novel view synthesis capabilities. With the rapid advancements in deep learning, a multitude of methods have emerged to explore the potential applications of NeRF in the domain of Autonomous Driving (AD). However, a conspicuous void is apparent within the current literature. To bridge this gap, this paper conducts a comprehensive survey of NeRF's applications in the context of AD. Our survey is structured to categorize NeRF's applications in Autonomous Driving (AD), specifically encompassing perception, 3D reconstruction, simultaneous localization and mapping (SLAM), and simulation. We delve into in-depth analysis and summarize the findings for each application category, and conclude by providing insights and discussions on future directions in this field. We hope this paper serves as a comprehensive reference for researchers in this domain. To the best of our knowledge, this is the first survey specifically focused on the applications of NeRF in the Autonomous Driving domain.

[72]  arXiv:2404.13819 [pdf, other]
Title: HOIST-Former: Hand-held Objects Identification, Segmentation, and Tracking in the Wild
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We address the challenging task of identifying, segmenting, and tracking hand-held objects, which is crucial for applications such as human action segmentation and performance evaluation. This task is particularly challenging due to heavy occlusion, rapid motion, and the transitory nature of objects being hand-held, where an object may be held, released, and subsequently picked up again. To tackle these challenges, we have developed a novel transformer-based architecture called HOIST-Former. HOIST-Former is adept at spatially and temporally segmenting hands and objects by iteratively pooling features from each other, ensuring that the processes of identification, segmentation, and tracking of hand-held objects depend on the hands' positions and their contextual appearance. We further refine HOIST-Former with a contact loss that focuses on areas where hands are in contact with objects. Moreover, we also contribute an in-the-wild video dataset called HOIST, which comprises 4,125 videos complete with bounding boxes, segmentation masks, and tracking IDs for hand-held objects. Through experiments on the HOIST dataset and two additional public datasets, we demonstrate the efficacy of HOIST-Former in segmenting and tracking hand-held objects.

[73]  arXiv:2404.13827 [pdf, other]
Title: Swap It Like Its Hot: Segmentation-based spoof attacks on eye-tracking images
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Video-based eye trackers capture the iris biometric and enable authentication to secure user identity. However, biometric authentication is susceptible to spoofing another user's identity through physical or digital manipulation. The current standard to identify physical spoofing attacks on eye-tracking sensors uses liveness detection. Liveness detection classifies gaze data as real or fake, which is sufficient to detect physical presentation attacks. However, such defenses cannot detect a spoofing attack when real eye image inputs are digitally manipulated to swap the iris pattern of another person. We propose IrisSwap as a novel attack on gaze-based liveness detection. IrisSwap allows attackers to segment and digitally swap in a victim's iris pattern to fool iris authentication. Both offline and online attacks produce gaze data that deceives the current state-of-the-art defense models at rates up to 58% and motivates the need to develop more advanced authentication methods for eye trackers.

[74]  arXiv:2404.13830 [pdf, other]
Title: A Comprehensive Survey and Taxonomy on Point Cloud Registration Based on Deep Learning
Comments: This paper is accepted by IJCAI 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Point cloud registration (PCR) involves determining a rigid transformation that aligns one point cloud to another. Despite the plethora of outstanding deep learning (DL)-based registration methods proposed, comprehensive and systematic studies on DL-based PCR techniques are still lacking. In this paper, we present a comprehensive survey and taxonomy of recently proposed PCR methods. Firstly, we conduct a taxonomy of commonly utilized datasets and evaluation metrics. Secondly, we classify the existing research into two main categories: supervised and unsupervised registration, providing insights into the core concepts of various influential PCR models. Finally, we highlight open challenges and potential directions for future research. A curated collection of valuable resources is made available at https://github.com/yxzhang15/PCR.

[75]  arXiv:2404.13838 [pdf, ps, other]
Title: C2F-SemiCD: A Coarse-to-Fine Semi-Supervised Change Detection Method Based on Consistency Regularization in High-Resolution Remote Sensing Images
Subjects: Computer Vision and Pattern Recognition (cs.CV)

A high-precision feature extraction model is crucial for change detection (CD). In the past, many deep learning-based supervised CD methods learned to recognize change feature patterns from a large number of labelled bi-temporal images, whereas labelling bi-temporal remote sensing images is very expensive and often time-consuming; therefore, we propose a coarse-to-fine semi-supervised CD method based on consistency regularization (C2F-SemiCD), which includes a coarse-to-fine CD network with a multiscale attention mechanism (C2FNet) and a semi-supervised update method. Among them, the C2FNet network gradually completes the extraction of change features from coarse-grained to fine-grained through multiscale feature fusion, channel attention mechanism, spatial attention mechanism, global context module, feature refine module, initial aggregation module, and final aggregation module. The semi-supervised update method uses the mean teacher method. The parameters of the student model are updated to the parameters of the teacher Model by using the exponential moving average (EMA) method. Through extensive experiments on three datasets and meticulous ablation studies, including crossover experiments across datasets, we verify the significant effectiveness and efficiency of the proposed C2F-SemiCD method. The code will be open at: https://github.com/ChengxiHAN/C2F-SemiCDand-C2FNet.

[76]  arXiv:2404.13842 [pdf, other]
Title: On Support Relations Inference and Scene Hierarchy Graph Construction from Point Cloud in Clustered Environments
Authors: Gang Ma, Hui Wei
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)

Over the years, scene understanding has attracted a growing interest in computer vision, providing the semantic and physical scene information necessary for robots to complete some particular tasks autonomously. In 3D scenes, rich spatial geometric and topological information are often ignored by RGB-based approaches for scene understanding. In this study, we develop a bottom-up approach for scene understanding that infers support relations between objects from a point cloud. Our approach utilizes the spatial topology information of the plane pairs in the scene, consisting of three major steps. 1) Detection of pairwise spatial configuration: dividing primitive pairs into local support connection and local inner connection; 2) primitive classification: a combinatorial optimization method applied to classify primitives; and 3) support relations inference and hierarchy graph construction: bottom-up support relations inference and scene hierarchy graph construction containing primitive level and object level. Through experiments, we demonstrate that the algorithm achieves excellent performance in primitive classification and support relations inference. Additionally, we show that the scene hierarchy graph contains rich geometric and topological information of objects, and it possesses great scalability for scene understanding.

[77]  arXiv:2404.13847 [pdf, other]
Title: EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

Visual Commonsense Reasoning (VCR) is a cognitive task, challenging models to answer visual questions requiring human commonsense, and to provide rationales explaining why the answers are correct. With emergence of Large Language Models (LLMs), it is natural and imperative to explore their applicability to VCR. However, VCR task demands more external knowledge to tackle its challenging questions, necessitating special designs to activate LLMs' commonsense reasoning abilities. Also, most existing Multimodal LLMs adopted an abstraction of entire input image, which makes it difficult to comprehend VCR's unique co-reference tags between image regions and text, posing challenges for fine-grained alignment. To address these issues, we propose EventLens that leverages Event-Aware Pretraining and Cross-modal Linking and EnhanceS VCR. First, by emulating the cognitive process of human reasoning, an Event-Aware Pretraining auxiliary task is introduced to better activate LLM's global comprehension of intricate scenarios. Second, during fine-tuning, we further utilize reference tags to bridge RoI features with texts, while preserving both modality semantics. Finally, we use instruct-style prompts to narrow the gap between pretraining and fine-tuning, and task-specific adapters to better integrate LLM's inherent knowledge with new commonsense. Experimental results show the effectiveness of our proposed auxiliary task and fine-grained linking strategy.

[78]  arXiv:2404.13848 [pdf, other]
Title: DSDRNet: Disentangling Representation and Reconstruct Network for Domain Generalization
Comments: This paper is accepted to IJCNN 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Domain generalization faces challenges due to the distribution shift between training and testing sets, and the presence of unseen target domains. Common solutions include domain alignment, meta-learning, data augmentation, or ensemble learning, all of which rely on domain labels or domain adversarial techniques. In this paper, we propose a Dual-Stream Separation and Reconstruction Network, dubbed DSDRNet. It is a disentanglement-reconstruction approach that integrates features of both inter-instance and intra-instance through dual-stream fusion. The method introduces novel supervised signals by combining inter-instance semantic distance and intra-instance similarity. Incorporating Adaptive Instance Normalization (AdaIN) into a two-stage cyclic reconstruction process enhances self-disentangled reconstruction signals to facilitate model convergence. Extensive experiments on four benchmark datasets demonstrate that DSDRNet outperforms other popular methods in terms of domain generalization capabilities.

[79]  arXiv:2404.13854 [pdf, other]
Title: Self-Supervised Monocular Depth Estimation in the Dark: Towards Data Distribution Compensation
Comments: Accepted by IJCAI2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Nighttime self-supervised monocular depth estimation has received increasing attention in recent years. However, using night images for self-supervision is unreliable because the photometric consistency assumption is usually violated in the videos taken under complex lighting conditions. Even with domain adaptation or photometric loss repair, performance is still limited by the poor supervision of night images on trainable networks. In this paper, we propose a self-supervised nighttime monocular depth estimation method that does not use any night images during training. Our framework utilizes day images as a stable source for self-supervision and applies physical priors (e.g., wave optics, reflection model and read-shot noise model) to compensate for some key day-night differences. With day-to-night data distribution compensation, our framework can be trained in an efficient one-stage self-supervised manner. Though no nighttime images are considered during training, qualitative and quantitative results demonstrate that our method achieves SoTA depth estimating results on the challenging nuScenes-Night and RobotCar-Night compared with existing methods.

[80]  arXiv:2404.13859 [pdf, other]
Title: Unveiling and Mitigating Generalized Biases of DNNs through the Intrinsic Dimensions of Perceptual Manifolds
Comments: 8pages, 6figures, Submitted to TPAMI
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Building fair deep neural networks (DNNs) is a crucial step towards achieving trustworthy artificial intelligence. Delving into deeper factors that affect the fairness of DNNs is paramount and serves as the foundation for mitigating model biases. However, current methods are limited in accurately predicting DNN biases, relying solely on the number of training samples and lacking more precise measurement tools. Here, we establish a geometric perspective for analyzing the fairness of DNNs, comprehensively exploring how DNNs internally shape the intrinsic geometric characteristics of datasets-the intrinsic dimensions (IDs) of perceptual manifolds, and the impact of IDs on the fairness of DNNs. Based on multiple findings, we propose Intrinsic Dimension Regularization (IDR), which enhances the fairness and performance of models by promoting the learning of concise and ID-balanced class perceptual manifolds. In various image recognition benchmark tests, IDR significantly mitigates model bias while improving its performance.

[81]  arXiv:2404.13862 [pdf, other]
Title: PGAHum: Prior-Guided Geometry and Appearance Learning for High-Fidelity Animatable Human Reconstruction
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent techniques on implicit geometry representation learning and neural rendering have shown promising results for 3D clothed human reconstruction from sparse video inputs. However, it is still challenging to reconstruct detailed surface geometry and even more difficult to synthesize photorealistic novel views with animated human poses. In this work, we introduce PGAHum, a prior-guided geometry and appearance learning framework for high-fidelity animatable human reconstruction. We thoroughly exploit 3D human priors in three key modules of PGAHum to achieve high-quality geometry reconstruction with intricate details and photorealistic view synthesis on unseen poses. First, a prior-based implicit geometry representation of 3D human, which contains a delta SDF predicted by a tri-plane network and a base SDF derived from the prior SMPL model, is proposed to model the surface details and the body shape in a disentangled manner. Second, we introduce a novel prior-guided sampling strategy that fully leverages the prior information of the human pose and body to sample the query points within or near the body surface. By avoiding unnecessary learning in the empty 3D space, the neural rendering can recover more appearance details. Last, we propose a novel iterative backward deformation strategy to progressively find the correspondence for the query point in observation space. A skinning weights prediction model is learned based on the prior provided by the SMPL model to achieve the iterative backward LBS deformation. Extensive quantitative and qualitative comparisons on various datasets are conducted and the results demonstrate the superiority of our framework. Ablation studies also verify the effectiveness of each scheme for geometry and appearance learning.

[82]  arXiv:2404.13863 [pdf, other]
Title: PM-VIS: High-Performance Box-Supervised Video Instance Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Labeling pixel-wise object masks in videos is a resource-intensive and laborious process. Box-supervised Video Instance Segmentation (VIS) methods have emerged as a viable solution to mitigate the labor-intensive annotation process. . In practical applications, the two-step approach is not only more flexible but also exhibits a higher recognition accuracy. Inspired by the recent success of Segment Anything Model (SAM), we introduce a novel approach that aims at harnessing instance box annotations from multiple perspectives to generate high-quality instance pseudo masks, thus enriching the information contained in instance annotations. We leverage ground-truth boxes to create three types of pseudo masks using the HQ-SAM model, the box-supervised VIS model (IDOL-BoxInst), and the VOS model (DeAOT) separately, along with three corresponding optimization mechanisms. Additionally, we introduce two ground-truth data filtering methods, assisted by high-quality pseudo masks, to further enhance the training dataset quality and improve the performance of fully supervised VIS methods. To fully capitalize on the obtained high-quality Pseudo Masks, we introduce a novel algorithm, PM-VIS, to integrate mask losses into IDOL-BoxInst. Our PM-VIS model, trained with high-quality pseudo mask annotations, demonstrates strong ability in instance mask prediction, achieving state-of-the-art performance on the YouTube-VIS 2019, YouTube-VIS 2021, and OVIS validation sets, notably narrowing the gap between box-supervised and fully supervised VIS methods.

[83]  arXiv:2404.13866 [pdf, other]
Title: Plug-and-Play Algorithm Convergence Analysis From The Standpoint of Stochastic Differential Equation
Comments: 17pages, Preprint, Under review
Subjects: Computer Vision and Pattern Recognition (cs.CV); Probability (math.PR)

The Plug-and-Play (PnP) algorithm is popular for inverse image problem-solving. However, this algorithm lacks theoretical analysis of its convergence with more advanced plug-in denoisers. We demonstrate that discrete PnP iteration can be described by a continuous stochastic differential equation (SDE). We can also achieve this transformation through Markov process formulation of PnP. Then, we can take a higher standpoint of PnP algorithms from stochastic differential equations, and give a unified framework for the convergence property of PnP according to the solvability condition of its corresponding SDE. We reveal that a much weaker condition, bounded denoiser with Lipschitz continuous measurement function would be enough for its convergence guarantee, instead of previous Lipschitz continuous denoiser condition.

[84]  arXiv:2404.13868 [pdf, other]
Title: TeamTrack: A Dataset for Multi-Sport Multi-Object Tracking in Full-pitch Videos
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Multi-object tracking (MOT) is a critical and challenging task in computer vision, particularly in situations involving objects with similar appearances but diverse movements, as seen in team sports. Current methods, largely reliant on object detection and appearance, often fail to track targets in such complex scenarios accurately. This limitation is further exacerbated by the lack of comprehensive and diverse datasets covering the full view of sports pitches. Addressing these issues, we introduce TeamTrack, a pioneering benchmark dataset specifically designed for MOT in sports. TeamTrack is an extensive collection of full-pitch video data from various sports, including soccer, basketball, and handball. Furthermore, we perform a comprehensive analysis and benchmarking effort to underscore TeamTrack's utility and potential impact. Our work signifies a crucial step forward, promising to elevate the precision and effectiveness of MOT in complex, dynamic settings such as team sports. The dataset, project code and competition is released at: https://atomscott.github.io/TeamTrack/.

[85]  arXiv:2404.13872 [pdf, other]
Title: FreqBlender: Enhancing DeepFake Detection by Blending Frequency Knowledge
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Generating synthetic fake faces, known as pseudo-fake faces, is an effective way to improve the generalization of DeepFake detection. Existing methods typically generate these faces by blending real or fake faces in color space. While these methods have shown promise, they overlook the simulation of frequency distribution in pseudo-fake faces, limiting the learning of generic forgery traces in-depth. To address this, this paper introduces {\em FreqBlender}, a new method that can generate pseudo-fake faces by blending frequency knowledge. Specifically, we investigate the major frequency components and propose a Frequency Parsing Network to adaptively partition frequency components related to forgery traces. Then we blend this frequency knowledge from fake faces into real faces to generate pseudo-fake faces. Since there is no ground truth for frequency components, we describe a dedicated training strategy by leveraging the inner correlations among different frequency knowledge to instruct the learning process. Experimental results demonstrate the effectiveness of our method in enhancing DeepFake detection, making it a potential plug-and-play strategy for other methods.

[86]  arXiv:2404.13873 [pdf, other]
Title: Texture-aware and Shape-guided Transformer for Sequential DeepFake Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Sequential DeepFake detection is an emerging task that aims to predict the manipulation sequence in order. Existing methods typically formulate it as an image-to-sequence problem, employing conventional Transformer architectures for detection. However, these methods lack dedicated design and consequently result in limited performance. In this paper, we propose a novel Texture-aware and Shape-guided Transformer to enhance detection performance. Our method features four major improvements. Firstly, we describe a texture-aware branch that effectively captures subtle manipulation traces with the Diversiform Pixel Difference Attention module. Then we introduce a Bidirectional Interaction Cross-attention module that seeks deep correlations among spatial and sequential features, enabling effective modeling of complex manipulation traces. To further enhance the cross-attention, we describe a Shape-guided Gaussian mapping strategy, providing initial priors of the manipulation shape. Finally, observing that the latter manipulation in a sequence may influence traces left in the earlier one, we intriguingly invert the prediction order from forward to backward, leading to notable gains as expected. Extensive experimental results demonstrate that our method outperforms others by a large margin, highlighting the superiority of our method.

[87]  arXiv:2404.13880 [pdf, other]
Title: Regional Style and Color Transfer
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper presents a novel contribution to the field of regional style transfer. Existing methods often suffer from the drawback of applying style homogeneously across the entire image, leading to stylistic inconsistencies or foreground object twisted when applied to image with foreground elements such as person figures. To address this limitation, we propose a new approach that leverages a segmentation network to precisely isolate foreground objects within the input image. Subsequently, style transfer is applied exclusively to the background region. The isolated foreground objects are then carefully reintegrated into the style-transferred background. To enhance the visual coherence between foreground and background, a color transfer step is employed on the foreground elements prior to their rein-corporation. Finally, we utilize feathering techniques to achieve a seamless amalgamation of foreground and background, resulting in a visually unified and aesthetically pleasing final composition. Extensive evaluations demonstrate that our proposed approach yields significantly more natural stylistic transformations compared to conventional methods.

[88]  arXiv:2404.13896 [pdf, other]
Title: CT-NeRF: Incremental Optimizing Neural Radiance Field and Poses with Complex Trajectory
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Neural radiance field (NeRF) has achieved impressive results in high-quality 3D scene reconstruction. However, NeRF heavily relies on precise camera poses. While recent works like BARF have introduced camera pose optimization within NeRF, their applicability is limited to simple trajectory scenes. Existing methods struggle while tackling complex trajectories involving large rotations. To address this limitation, we propose CT-NeRF, an incremental reconstruction optimization pipeline using only RGB images without pose and depth input. In this pipeline, we first propose a local-global bundle adjustment under a pose graph connecting neighboring frames to enforce the consistency between poses to escape the local minima caused by only pose consistency with the scene structure. Further, we instantiate the consistency between poses as a reprojected geometric image distance constraint resulting from pixel-level correspondences between input image pairs. Through the incremental reconstruction, CT-NeRF enables the recovery of both camera poses and scene structure and is capable of handling scenes with complex trajectories. We evaluate the performance of CT-NeRF on two real-world datasets, NeRFBuster and Free-Dataset, which feature complex trajectories. Results show CT-NeRF outperforms existing methods in novel view synthesis and pose estimation accuracy.

[89]  arXiv:2404.13903 [pdf, other]
Title: Accelerating Image Generation with Sub-path Linear Approximation Model
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Diffusion models have significantly advanced the state of the art in image, audio, and video generation tasks. However, their applications in practical scenarios are hindered by slow inference speed. Drawing inspiration from the approximation strategies utilized in consistency models, we propose the Sub-path Linear Approximation Model (SLAM), which accelerates diffusion models while maintaining high-quality image generation. SLAM treats the PF-ODE trajectory as a series of PF-ODE sub-paths divided by sampled points, and harnesses sub-path linear (SL) ODEs to form a progressive and continuous error estimation along each individual PF-ODE sub-path. The optimization on such SL-ODEs allows SLAM to construct denoising mappings with smaller cumulative approximated errors. An efficient distillation method is also developed to facilitate the incorporation of more advanced diffusion models, such as latent diffusion models. Our extensive experimental results demonstrate that SLAM achieves an efficient training regimen, requiring only 6 A100 GPU days to produce a high-quality generative model capable of 2 to 4-step generation with high performance. Comprehensive evaluations on LAION, MS COCO 2014, and MS COCO 2017 datasets also illustrate that SLAM surpasses existing acceleration methods in few-step generation tasks, achieving state-of-the-art performance both on FID and the quality of the generated images.

[90]  arXiv:2404.13911 [pdf, ps, other]
Title: Global OpenBuildingMap -- Unveiling the Mystery of Global Buildings
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Understanding how buildings are distributed globally is crucial to revealing the human footprint on our home planet. This built environment affects local climate, land surface albedo, resource distribution, and many other key factors that influence well-being and human health. Despite this, quantitative and comprehensive data on the distribution and properties of buildings worldwide is lacking. To this end, by using a big data analytics approach and nearly 800,000 satellite images, we generated the highest resolution and highest accuracy building map ever created: the Global OpenBuildingMap (Global OBM). A joint analysis of building maps and solar potentials indicates that rooftop solar energy can supply the global energy consumption need at a reasonable cost. Specifically, if solar panels were placed on the roofs of all buildings, they could supply 1.1-3.3 times -- depending on the efficiency of the solar device -- the global energy consumption in 2020, which is the year with the highest consumption on record. We also identified a clear geospatial correlation between building areas and key socioeconomic variables, which indicates our global building map can serve as an important input to modeling global socioeconomic needs and drivers.

[91]  arXiv:2404.13921 [pdf, other]
Title: NeRF-DetS: Enhancing Multi-View 3D Object Detection with Sampling-adaptive Network of Continuous NeRF-based Representation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

As a preliminary work, NeRF-Det unifies the tasks of novel view synthesis and 3D perception, demonstrating that perceptual tasks can benefit from novel view synthesis methods like NeRF, significantly improving the performance of indoor multi-view 3D object detection. Using the geometry MLP of NeRF to direct the attention of detection head to crucial parts and incorporating self-supervised loss from novel view rendering contribute to the achieved improvement. To better leverage the notable advantages of the continuous representation through neural rendering in space, we introduce a novel 3D perception network structure, NeRF-DetS. The key component of NeRF-DetS is the Multi-level Sampling-Adaptive Network, making the sampling process adaptively from coarse to fine. Also, we propose a superior multi-view information fusion method, known as Multi-head Weighted Fusion. This fusion approach efficiently addresses the challenge of losing multi-view information when using arithmetic mean, while keeping low computational costs. NeRF-DetS outperforms competitive NeRF-Det on the ScanNetV2 dataset, by achieving +5.02% and +5.92% improvement in mAP@.25 and mAP@.50, respectively.

[92]  arXiv:2404.13923 [pdf, other]
Title: MaterialSeg3D: Segmenting Dense Materials from 2D Priors for 3D Assets
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Driven by powerful image diffusion models, recent research has achieved the automatic creation of 3D objects from textual or visual guidance. By performing score distillation sampling (SDS) iteratively across different views, these methods succeed in lifting 2D generative prior to the 3D space. However, such a 2D generative image prior bakes the effect of illumination and shadow into the texture. As a result, material maps optimized by SDS inevitably involve spurious correlated components. The absence of precise material definition makes it infeasible to relight the generated assets reasonably in novel scenes, which limits their application in downstream scenarios. In contrast, humans can effortlessly circumvent this ambiguity by deducing the material of the object from its appearance and semantics. Motivated by this insight, we propose MaterialSeg3D, a 3D asset material generation framework to infer underlying material from the 2D semantic prior. Based on such a prior model, we devise a mechanism to parse material in 3D space. We maintain a UV stack, each map of which is unprojected from a specific viewpoint. After traversing all viewpoints, we fuse the stack through a weighted voting scheme and then employ region unification to ensure the coherence of the object parts. To fuel the learning of semantics prior, we collect a material dataset, named Materialized Individual Objects (MIO), which features abundant images, diverse categories, and accurate annotations. Extensive quantitative and qualitative experiments demonstrate the effectiveness of our method.

[93]  arXiv:2404.13944 [pdf, other]
Title: Gorgeous: Create Your Desired Character Facial Makeup from Any Ideas
Comments: Project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

Contemporary makeup transfer methods primarily focus on replicating makeup from one face to another, considerably limiting their use in creating diverse and creative character makeup essential for visual storytelling. Such methods typically fail to address the need for uniqueness and contextual relevance, specifically aligning with character and story settings as they depend heavily on existing facial makeup in reference images. This approach also presents a significant challenge when attempting to source a perfectly matched facial makeup style, further complicating the creation of makeup designs inspired by various story elements, such as theme, background, and props that do not necessarily feature faces. To address these limitations, we introduce $Gorgeous$, a novel diffusion-based makeup application method that goes beyond simple transfer by innovatively crafting unique and thematic facial makeup. Unlike traditional methods, $Gorgeous$ does not require the presence of a face in the reference images. Instead, it draws artistic inspiration from a minimal set of three to five images, which can be of any type, and transforms these elements into practical makeup applications directly on the face. Our comprehensive experiments demonstrate that $Gorgeous$ can effectively generate distinctive character facial makeup inspired by the chosen thematic reference images. This approach opens up new possibilities for integrating broader story elements into character makeup, thereby enhancing the narrative depth and visual impact in storytelling.

[94]  arXiv:2404.13947 [pdf, other]
Title: Boter: Bootstrapping Knowledge Selection and Question Answering for Knowledge-based VQA
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Knowledge-based Visual Question Answering (VQA) requires models to incorporate external knowledge to respond to questions about visual content. Previous methods mostly follow the "retrieve and generate" paradigm. Initially, they utilize a pre-trained retriever to fetch relevant knowledge documents, subsequently employing them to generate answers. While these methods have demonstrated commendable performance in the task, they possess limitations: (1) they employ an independent retriever to acquire knowledge solely based on the similarity between the query and knowledge embeddings, without assessing whether the knowledge document is truly conducive to helping answer the question; (2) they convert the image into text and then conduct retrieval and answering in natural language space, which may not ensure comprehensive acquisition of all image information. To address these limitations, we propose Boter, a novel framework designed to bootstrap knowledge selection and question answering by leveraging the robust multimodal perception capabilities of the Multimodal Large Language Model (MLLM). The framework consists of two modules: Selector and Answerer, where both are initialized by the MLLM and parameter-efficiently finetuned in a simple cycle: find key knowledge in the retrieved knowledge documents using the Selector, and then use them to finetune the Answerer to predict answers; obtain the pseudo-labels of key knowledge documents based on the predictions of the Answerer and weak supervision labels, and then finetune the Selector to select key knowledge; repeat. Our framework significantly enhances the performance of the baseline on the challenging open-domain Knowledge-based VQA benchmark, OK-VQA, achieving a state-of-the-art accuracy of 62.83%.

[95]  arXiv:2404.13949 [pdf, other]
Title: PeLiCal: Targetless Extrinsic Calibration via Penetrating Lines for RGB-D Cameras with Limited Co-visibility
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

RGB-D cameras are crucial in robotic perception, given their ability to produce images augmented with depth data. However, their limited FOV often requires multiple cameras to cover a broader area. In multi-camera RGB-D setups, the goal is typically to reduce camera overlap, optimizing spatial coverage with as few cameras as possible. The extrinsic calibration of these systems introduces additional complexities. Existing methods for extrinsic calibration either necessitate specific tools or highly depend on the accuracy of camera motion estimation. To address these issues, we present PeLiCal, a novel line-based calibration approach for RGB-D camera systems exhibiting limited overlap. Our method leverages long line features from surroundings, and filters out outliers with a novel convergence voting algorithm, achieving targetless, real-time, and outlier-robust performance compared to existing methods. We open source our implementation on \url{https://github.com/joomeok/PeLiCal.git}.

[96]  arXiv:2404.13953 [pdf, other]
Title: 360VOTS: Visual Object Tracking and Segmentation in Omnidirectional Videos
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Visual object tracking and segmentation in omnidirectional videos are challenging due to the wide field-of-view and large spherical distortion brought by 360{\deg} images. To alleviate these problems, we introduce a novel representation, extended bounding field-of-view (eBFoV), for target localization and use it as the foundation of a general 360 tracking framework which is applicable for both omnidirectional visual object tracking and segmentation tasks. Building upon our previous work on omnidirectional visual object tracking (360VOT), we propose a comprehensive dataset and benchmark that incorporates a new component called omnidirectional video object segmentation (360VOS). The 360VOS dataset includes 290 sequences accompanied by dense pixel-wise masks and covers a broader range of target categories. To support both the development and evaluation of algorithms in this domain, we divide the dataset into a training subset with 170 sequences and a testing subset with 120 sequences. Furthermore, we tailor evaluation metrics for both omnidirectional tracking and segmentation to ensure rigorous assessment. Through extensive experiments, we benchmark state-of-the-art approaches and demonstrate the effectiveness of our proposed 360 tracking framework and training dataset. Homepage: https://360vots.hkustvgd.com/

[97]  arXiv:2404.13972 [pdf, other]
Title: Non-Uniform Exposure Imaging via Neuromorphic Shutter Control
Subjects: Computer Vision and Pattern Recognition (cs.CV)

By leveraging the blur-noise trade-off, imaging with non-uniform exposures largely extends the image acquisition flexibility in harsh environments. However, the limitation of conventional cameras in perceiving intra-frame dynamic information prevents existing methods from being implemented in the real-world frame acquisition for real-time adaptive camera shutter control. To address this challenge, we propose a novel Neuromorphic Shutter Control (NSC) system to avoid motion blurs and alleviate instant noises, where the extremely low latency of events is leveraged to monitor the real-time motion and facilitate the scene-adaptive exposure. Furthermore, to stabilize the inconsistent Signal-to-Noise Ratio (SNR) caused by the non-uniform exposure times, we propose an event-based image denoising network within a self-supervised learning paradigm, i.e., SEID, exploring the statistics of image noises and inter-frame motion information of events to obtain artificial supervision signals for high-quality imaging in real-world scenes. To illustrate the effectiveness of the proposed NSC, we implement it in hardware by building a hybrid-camera imaging prototype system, with which we collect a real-world dataset containing well-synchronized frames and events in diverse scenarios with different target scenes and motion patterns. Experiments on the synthetic and real-world datasets demonstrate the superiority of our method over state-of-the-art approaches.

[98]  arXiv:2404.13983 [pdf, other]
Title: Structure-Aware Human Body Reshaping with Adaptive Affinity-Graph Network
Comments: 11 pages;
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Given a source portrait, the automatic human body reshaping task aims at editing it to an aesthetic body shape. As the technology has been widely used in media, several methods have been proposed mainly focusing on generating optical flow to warp the body shape. However, those previous works only consider the local transformation of different body parts (arms, torso, and legs), ignoring the global affinity, and limiting the capacity to ensure consistency and quality across the entire body. In this paper, we propose a novel Adaptive Affinity-Graph Network (AAGN), which extracts the global affinity between different body parts to enhance the quality of the generated optical flow. Specifically, our AAGN primarily introduces the following designs: (1) we propose an Adaptive Affinity-Graph (AAG) Block that leverages the characteristic of a fully connected graph. AAG represents different body parts as nodes in an adaptive fully connected graph and captures all the affinities between nodes to obtain a global affinity map. The design could better improve the consistency between body parts. (2) Besides, for high-frequency details are crucial for photo aesthetics, a Body Shape Discriminator (BSD) is designed to extract information from both high-frequency and spatial domain. Particularly, an SRM filter is utilized to extract high-frequency details, which are combined with spatial features as input to the BSD. With this design, BSD guides the Flow Generator (FG) to pay attention to various fine details rather than rigid pixel-level fitting. Extensive experiments conducted on the BR-5K dataset demonstrate that our framework significantly enhances the aesthetic appeal of reshaped photos, marginally surpassing all previous work to achieve state-of-the-art in all evaluation metrics.

[99]  arXiv:2404.13984 [pdf, other]
Title: RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Although diffusion models can generate high-quality human images, their applications are limited by the instability in generating hands with correct structures. Some previous works mitigate the problem by considering hand structure yet struggle to maintain style consistency between refined malformed hands and other image regions. In this paper, we aim to solve the problem of inconsistency regarding hand structure and style. We propose a conditional diffusion-based framework RHanDS to refine the hand region with the help of decoupled structure and style guidance. Specifically, the structure guidance is the hand mesh reconstructed from the malformed hand, serving to correct the hand structure. The style guidance is a hand image, e.g., the malformed hand itself, and is employed to furnish the style reference for hand refining. In order to suppress the structure leakage when referencing hand style and effectively utilize hand data to improve the capability of the model, we build a multi-style hand dataset and introduce a twostage training strategy. In the first stage, we use paired hand images for training to generate hands with the same style as the reference. In the second stage, various hand images generated based on the human mesh are used for training to enable the model to gain control over the hand structure. We evaluate our method and counterparts on the test dataset of the proposed multi-style hand dataset. The experimental results show that RHanDS can effectively refine hands structure- and style- correctly compared with previous methods. The codes and datasets will be available soon.

[100]  arXiv:2404.13992 [pdf, other]
Title: Dynamic Proxy Domain Generalizes the Crowd Localization by Better Binary Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Crowd localization targets on predicting each instance precise location within an image. Current advanced methods propose the pixel-wise binary classification to tackle the congested prediction, in which the pixel-level thresholds binarize the prediction confidence of being the pedestrian head. Since the crowd scenes suffer from extremely varying contents, counts and scales, the confidence-threshold learner is fragile and under-generalized encountering domain knowledge shift. Moreover, at the most time, the target domain is agnostic in training. Hence, it is imperative to exploit how to enhance the generalization of confidence-threshold locator to the latent target domain. In this paper, we propose a Dynamic Proxy Domain (DPD) method to generalize the learner under domain shift. Concretely, based on the theoretical analysis to the generalization error risk upper bound on the latent target domain to a binary classifier, we propose to introduce a generated proxy domain to facilitate generalization. Then, based on the theory, we design a DPD algorithm which is composed by a training paradigm and proxy domain generator to enhance the domain generalization of the confidence-threshold learner. Besides, we conduct our method on five kinds of domain shift scenarios, demonstrating the effectiveness on generalizing the crowd localization. Our code will be available at https://github.com/zhangda1018/DPD.

[101]  arXiv:2404.13996 [pdf, other]
Title: Challenges in automatic and selective plant-clearing
Subjects: Computer Vision and Pattern Recognition (cs.CV)

With the advent of multispectral imagery and AI, there have been numerous works on automatic plant segmentation for purposes such as counting, picking, health monitoring, localized pesticide delivery, etc. In this paper, we tackle the related problem of automatic and selective plant-clearing in a sustainable forestry context, where an autonomous machine has to detect and avoid specific plants while clearing any weeds which may compete with the species being cultivated. Such an autonomous system requires a high level of robustness to weather conditions, plant variability, terrain and weeds while remaining cheap and easy to maintain. We notably discuss the lack of robustness of spectral imagery, investigate the impact of the reference database's size and discuss issues specific to AI systems operating in uncontrolled environments.

[102]  arXiv:2404.13999 [pdf, other]
Title: CoFInAl: Enhancing Action Quality Assessment with Coarse-to-Fine Instruction Alignment
Comments: Accepted by IJCAI 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Action Quality Assessment (AQA) is pivotal for quantifying actions across domains like sports and medical care. Existing methods often rely on pre-trained backbones from large-scale action recognition datasets to boost performance on smaller AQA datasets. However, this common strategy yields suboptimal results due to the inherent struggle of these backbones to capture the subtle cues essential for AQA. Moreover, fine-tuning on smaller datasets risks overfitting. To address these issues, we propose Coarse-to-Fine Instruction Alignment (CoFInAl). Inspired by recent advances in large language model tuning, CoFInAl aligns AQA with broader pre-trained tasks by reformulating it as a coarse-to-fine classification task. Initially, it learns grade prototypes for coarse assessment and then utilizes fixed sub-grade prototypes for fine-grained assessment. This hierarchical approach mirrors the judging process, enhancing interpretability within the AQA framework. Experimental results on two long-term AQA datasets demonstrate CoFInAl achieves state-of-the-art performance with significant correlation gains of 5.49% and 3.55% on Rhythmic Gymnastics and Fis-V, respectively. Our code is available at https://github.com/ZhouKanglei/CoFInAl_AQA.

[103]  arXiv:2404.14007 [pdf, other]
Title: Infusion: Preventing Customized Text-to-Image Diffusion from Overfitting
Comments: 10 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Text-to-image (T2I) customization aims to create images that embody specific visual concepts delineated in textual descriptions. However, existing works still face a main challenge, concept overfitting. To tackle this challenge, we first analyze overfitting, categorizing it into concept-agnostic overfitting, which undermines non-customized concept knowledge, and concept-specific overfitting, which is confined to customize on limited modalities, i.e, backgrounds, layouts, styles. To evaluate the overfitting degree, we further introduce two metrics, i.e, Latent Fisher divergence and Wasserstein metric to measure the distribution changes of non-customized and customized concept respectively. Drawing from the analysis, we propose Infusion, a T2I customization method that enables the learning of target concepts to avoid being constrained by limited training modalities, while preserving non-customized knowledge. Remarkably, Infusion achieves this feat with remarkable efficiency, requiring a mere 11KB of trained parameters. Extensive experiments also demonstrate that our approach outperforms state-of-the-art methods in both single and multi-concept customized generation.

[104]  arXiv:2404.14019 [pdf, ps, other]
Title: A Multimodal Feature Distillation with CNN-Transformer Network for Brain Tumor Segmentation with Incomplete Modalities
Subjects: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Applications (stat.AP)

Existing brain tumor segmentation methods usually utilize multiple Magnetic Resonance Imaging (MRI) modalities in brain tumor images for segmentation, which can achieve better segmentation performance. However, in clinical applications, some modalities are missing due to resource constraints, leading to severe degradation in the performance of methods applying complete modality segmentation. In this paper, we propose a Multimodal feature distillation with Convolutional Neural Network (CNN)-Transformer hybrid network (MCTSeg) for accurate brain tumor segmentation with missing modalities. We first design a Multimodal Feature Distillation (MFD) module to distill feature-level multimodal knowledge into different unimodality to extract complete modality information. We further develop a Unimodal Feature Enhancement (UFE) module to model the relationship between global and local information semantically. Finally, we build a Cross-Modal Fusion (CMF) module to explicitly align the global correlations among different modalities even when some modalities are missing. Complementary features within and across different modalities are refined via the CNN-Transformer hybrid architectures in both the UFE and CMF modules, where local and global dependencies are both captured. Our ablation study demonstrates the importance of the proposed modules with CNN-Transformer networks and the convolutional blocks in Transformer for improving the performance of brain tumor segmentation with missing modalities. Extensive experiments on the BraTS2018 and BraTS2020 datasets show that the proposed MCTSeg framework outperforms the state-of-the-art methods in missing modalities cases. Our code is available at: https://github.com/mkang315/MCTSeg.

[105]  arXiv:2404.14022 [pdf, other]
Title: Collaborative Perception Datasets in Autonomous Driving: A Survey
Comments: 8 pages,3 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

This survey offers a comprehensive examination of collaborative perception datasets in the context of Vehicle-to-Infrastructure (V2I), Vehicle-to-Vehicle (V2V), and Vehicle-to-Everything (V2X). It highlights the latest developments in large-scale benchmarks that accelerate advancements in perception tasks for autonomous vehicles. The paper systematically analyzes a variety of datasets, comparing them based on aspects such as diversity, sensor setup, quality, public availability, and their applicability to downstream tasks. It also highlights the key challenges such as domain shift, sensor setup limitations, and gaps in dataset diversity and availability. The importance of addressing privacy and security concerns in the development of datasets is emphasized, regarding data sharing and dataset creation. The conclusion underscores the necessity for comprehensive, globally accessible datasets and collaborative efforts from both technological and research communities to overcome these challenges and fully harness the potential of autonomous driving.

[106]  arXiv:2404.14025 [pdf, other]
Title: DHRNet: A Dual-Path Hierarchical Relation Network for Multi-Person Pose Estimation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Multi-person pose estimation (MPPE) presents a formidable yet crucial challenge in computer vision. Most existing methods predominantly concentrate on isolated interaction either between instances or joints, which is inadequate for scenarios demanding concurrent localization of both instances and joints. This paper introduces a novel CNN-based single-stage method, named Dual-path Hierarchical Relation Network (DHRNet), to extract instance-to-joint and joint-to-instance interactions concurrently. Specifically, we design a dual-path interaction modeling module (DIM) that strategically organizes cross-instance and cross-joint interaction modeling modules in two complementary orders, enriching interaction information by integrating merits from different correlation modeling branches. Notably, DHRNet excels in joint localization by leveraging information from other instances and joints. Extensive evaluations on challenging datasets, including COCO, CrowdPose, and OCHuman datasets, showcase DHRNet's state-of-the-art performance. The code will be released at https://github.com/YHDang/dhrnet-multi-pose-estimation.

[107]  arXiv:2404.14027 [pdf, other]
Title: OccFeat: Self-supervised Occupancy Feature Prediction for Pretraining BEV Segmentation Networks
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

We introduce a self-supervised pretraining method, called OcFeat, for camera-only Bird's-Eye-View (BEV) segmentation networks. With OccFeat, we pretrain a BEV network via occupancy prediction and feature distillation tasks. Occupancy prediction provides a 3D geometric understanding of the scene to the model. However, the geometry learned is class-agnostic. Hence, we add semantic information to the model in the 3D space through distillation from a self-supervised pretrained image foundation model. Models pretrained with our method exhibit improved BEV semantic segmentation performance, particularly in low-data scenarios. Moreover, empirical results affirm the efficacy of integrating feature distillation with 3D occupancy prediction in our pretraining approach.

[108]  arXiv:2404.14032 [pdf, other]
Title: 1st Place Solution to the 1st SkatingVerse Challenge
Comments: 3 pages, 1st SkatingVerse Challenge, 18th IEEE International Conference on Automatic Face and Gesture Recognition workshop
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper presents the winning solution for the 1st SkatingVerse Challenge. We propose a method that involves several steps. To begin, we leverage the DINO framework to extract the Region of Interest (ROI) and perform precise cropping of the raw video footage. Subsequently, we employ three distinct models, namely Unmasked Teacher, UniformerV2, and InfoGCN, to capture different aspects of the data. By ensembling the prediction results based on logits, our solution attains an impressive leaderboard score of 95.73%.

[109]  arXiv:2404.14034 [pdf, other]
Title: PointDifformer: Robust Point Cloud Registration With Neural Diffusion and Transformer
Comments: Accepted by IEEE Transactions on Geoscience and Remote Sensing
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Point cloud registration is a fundamental technique in 3-D computer vision with applications in graphics, autonomous driving, and robotics. However, registration tasks under challenging conditions, under which noise or perturbations are prevalent, can be difficult. We propose a robust point cloud registration approach that leverages graph neural partial differential equations (PDEs) and heat kernel signatures. Our method first uses graph neural PDE modules to extract high dimensional features from point clouds by aggregating information from the 3-D point neighborhood, thereby enhancing the robustness of the feature representations. Then, we incorporate heat kernel signatures into an attention mechanism to efficiently obtain corresponding keypoints. Finally, a singular value decomposition (SVD) module with learnable weights is used to predict the transformation between two point clouds. Empirical experiments on a 3-D point cloud dataset demonstrate that our approach not only achieves state-of-the-art performance for point cloud registration but also exhibits better robustness to additive noise or 3-D shape perturbations.

[110]  arXiv:2404.14037 [pdf, other]
Title: GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

Recent works on audio-driven talking head synthesis using Neural Radiance Fields (NeRF) have achieved impressive results. However, due to inadequate pose and expression control caused by NeRF implicit representation, these methods still have some limitations, such as unsynchronized or unnatural lip movements, and visual jitter and artifacts. In this paper, we propose GaussianTalker, a novel method for audio-driven talking head synthesis based on 3D Gaussian Splatting. With the explicit representation property of 3D Gaussians, intuitive control of the facial motion is achieved by binding Gaussians to 3D facial models. GaussianTalker consists of two modules, Speaker-specific Motion Translator and Dynamic Gaussian Renderer. Speaker-specific Motion Translator achieves accurate lip movements specific to the target speaker through universalized audio feature extraction and customized lip motion generation. Dynamic Gaussian Renderer introduces Speaker-specific BlendShapes to enhance facial detail representation via a latent pose, delivering stable and realistic rendered videos. Extensive experimental results suggest that GaussianTalker outperforms existing state-of-the-art methods in talking head synthesis, delivering precise lip synchronization and exceptional visual quality. Our method achieves rendering speeds of 130 FPS on NVIDIA RTX4090 GPU, significantly exceeding the threshold for real-time rendering performance, and can potentially be deployed on other hardware platforms.

[111]  arXiv:2404.14040 [pdf, other]
Title: Surgical-DeSAM: Decoupling SAM for Instrument Segmentation in Robotic Surgery
Comments: 8 pages, 2 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Purpose: The recent Segment Anything Model (SAM) has demonstrated impressive performance with point, text or bounding box prompts, in various applications. However, in safety-critical surgical tasks, prompting is not possible due to (i) the lack of per-frame prompts for supervised learning, (ii) it is unrealistic to prompt frame-by-frame in a real-time tracking application, and (iii) it is expensive to annotate prompts for offline applications.
Methods: We develop Surgical-DeSAM to generate automatic bounding box prompts for decoupling SAM to obtain instrument segmentation in real-time robotic surgery. We utilise a commonly used detection architecture, DETR, and fine-tuned it to obtain bounding box prompt for the instruments. We then empolyed decoupling SAM (DeSAM) by replacing the image encoder with DETR encoder and fine-tune prompt encoder and mask decoder to obtain instance segmentation for the surgical instruments. To improve detection performance, we adopted the Swin-transformer to better feature representation.
Results: The proposed method has been validated on two publicly available datasets from the MICCAI surgical instruments segmentation challenge EndoVis 2017 and 2018. The performance of our method is also compared with SOTA instrument segmentation methods and demonstrated significant improvements with dice metrics of 89.62 and 90.70 for the EndoVis 2017 and 2018.
Conclusion: Our extensive experiments and validations demonstrate that Surgical-DeSAM enables real-time instrument segmentation without any additional prompting and outperforms other SOTA segmentation methods.

[112]  arXiv:2404.14042 [pdf, other]
Title: CloudFort: Enhancing Robustness of 3D Point Cloud Classification Against Backdoor Attacks via Spatial Partitioning and Ensemble Prediction
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The increasing adoption of 3D point cloud data in various applications, such as autonomous vehicles, robotics, and virtual reality, has brought about significant advancements in object recognition and scene understanding. However, this progress is accompanied by new security challenges, particularly in the form of backdoor attacks. These attacks involve inserting malicious information into the training data of machine learning models, potentially compromising the model's behavior. In this paper, we propose CloudFort, a novel defense mechanism designed to enhance the robustness of 3D point cloud classifiers against backdoor attacks. CloudFort leverages spatial partitioning and ensemble prediction techniques to effectively mitigate the impact of backdoor triggers while preserving the model's performance on clean data. We evaluate the effectiveness of CloudFort through extensive experiments, demonstrating its strong resilience against the Point Cloud Backdoor Attack (PCBA). Our results show that CloudFort significantly enhances the security of 3D point cloud classification models without compromising their accuracy on benign samples. Furthermore, we explore the limitations of CloudFort and discuss potential avenues for future research in the field of 3D point cloud security. The proposed defense mechanism represents a significant step towards ensuring the trustworthiness and reliability of point-cloud-based systems in real-world applications.

[113]  arXiv:2404.14044 [pdf, other]
Title: HashPoint: Accelerated Point Searching and Sampling for Neural Rendering
Comments: CVPR2024 Highlight
Journal-ref: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we address the problem of efficient point searching and sampling for volume neural rendering. Within this realm, two typical approaches are employed: rasterization and ray tracing. The rasterization-based methods enable real-time rendering at the cost of increased memory and lower fidelity. In contrast, the ray-tracing-based methods yield superior quality but demand longer rendering time. We solve this problem by our HashPoint method combining these two strategies, leveraging rasterization for efficient point searching and sampling, and ray marching for rendering. Our method optimizes point searching by rasterizing points within the camera's view, organizing them in a hash table, and facilitating rapid searches. Notably, we accelerate the rendering process by adaptive sampling on the primary surface encountered by the ray. Our approach yields substantial speed-up for a range of state-of-the-art ray-tracing-based methods, maintaining equivalent or superior accuracy across synthetic and real test datasets. The code will be available at https://jiahao-ma.github.io/hashpoint/.

[114]  arXiv:2404.14055 [pdf, other]
Title: RingID: Rethinking Tree-Ring Watermarking for Enhanced Multi-Key Identification
Comments: 25 pages, 8 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We revisit Tree-Ring Watermarking, a recent diffusion model watermarking method that demonstrates great robustness to various attacks. We conduct an in-depth study on it and reveal that the distribution shift unintentionally introduced by the watermarking process, apart from watermark pattern matching, contributes to its exceptional robustness. Our investigation further exposes inherent flaws in its original design, particularly in its ability to identify multiple distinct keys, where distribution shift offers no assistance. Based on these findings and analysis, we present RingID for enhanced multi-key identification. It consists of a novel multi-channel heterogeneous watermarking approach designed to seamlessly amalgamate distinctive advantages from diverse watermarks. Coupled with a series of suggested enhancements, RingID exhibits substantial advancements in multi-key identification.

[115]  arXiv:2404.14062 [pdf, other]
Title: GatedLexiconNet: A Comprehensive End-to-End Handwritten Paragraph Text Recognition System
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

The Handwritten Text Recognition problem has been a challenge for researchers for the last few decades, especially in the domain of computer vision, a subdomain of pattern recognition. Variability of texts amongst writers, cursiveness, and different font styles of handwritten texts with degradation of historical text images make it a challenging problem. Recognizing scanned document images in neural network-based systems typically involves a two-step approach: segmentation and recognition. However, this method has several drawbacks. These shortcomings encompass challenges in identifying text regions, analyzing layout diversity within pages, and establishing accurate ground truth segmentation. Consequently, these processes are prone to errors, leading to bottlenecks in achieving high recognition accuracies. Thus, in this study, we present an end-to-end paragraph recognition system that incorporates internal line segmentation and gated convolutional layers based encoder. The gating is a mechanism that controls the flow of information and allows to adaptively selection of the more relevant features in handwritten text recognition models. The attention module plays an important role in performing internal line segmentation, allowing the page to be processed line-by-line. During the decoding step, we have integrated a connectionist temporal classification-based word beam search decoder as a post-processing step. In this work, we have extended existing LexiconNet by carefully applying and utilizing gated convolutional layers in the existing deep neural network. Our results at line and page levels also favour our new GatedLexiconNet. This study reported character error rates of 2.27% on IAM, 0.9% on RIMES, and 2.13% on READ-16, and word error rates of 5.73% on IAM, 2.76% on RIMES, and 6.52% on READ-2016 datasets.

[116]  arXiv:2404.14066 [pdf, other]
Title: SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
Subjects: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

The user base of short video apps has experienced unprecedented growth in recent years, resulting in a significant demand for video content analysis. In particular, text-video retrieval, which aims to find the top matching videos given text descriptions from a vast video corpus, is an essential function, the primary challenge of which is to bridge the modality gap. Nevertheless, most existing approaches treat texts merely as discrete tokens and neglect their syntax structures. Moreover, the abundant spatial and temporal clues in videos are often underutilized due to the lack of interaction with text. To address these issues, we argue that using texts as guidance to focus on relevant temporal frames and spatial regions within videos is beneficial. In this paper, we propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net) that exploits the inherent semantic and syntax hierarchy of texts to bridge the modality gap from two perspectives. First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions, to guide the visual representations. Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation. We evaluated our method on four public text-video retrieval datasets of MSR-VTT, MSVD, DiDeMo, and ActivityNet. The experimental results and ablation studies confirm the advantages of our proposed method.

[117]  arXiv:2404.14099 [pdf, other]
Title: DynaMMo: Dynamic Model Merging for Efficient Class Incremental Learning for Medical Images
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Continual learning, the ability to acquire knowledge from new data while retaining previously learned information, is a fundamental challenge in machine learning. Various approaches, including memory replay, knowledge distillation, model regularization, and dynamic network expansion, have been proposed to address this issue. Thus far, dynamic network expansion methods have achieved state-of-the-art performance at the cost of incurring significant computational overhead. This is due to the need for additional model buffers, which makes it less feasible in resource-constrained settings, particularly in the medical domain. To overcome this challenge, we propose Dynamic Model Merging, DynaMMo, a method that merges multiple networks at different stages of model training to achieve better computational efficiency. Specifically, we employ lightweight learnable modules for each task and combine them into a unified model to minimize computational overhead. DynaMMo achieves this without compromising performance, offering a cost-effective solution for continual learning in medical applications. We evaluate DynaMMo on three publicly available datasets, demonstrating its effectiveness compared to existing approaches. DynaMMo offers around 10-fold reduction in GFLOPS with a small drop of 2.76 in average accuracy when compared to state-of-the-art dynamic-based approaches. The code implementation of this work will be available upon the acceptance of this work at https://github.com/BioMedIA-MBZUAI/DynaMMo.

[118]  arXiv:2404.14109 [pdf, other]
Title: CKD: Contrastive Knowledge Distillation from A Sample-wise Perspective
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we present a simple yet effective contrastive knowledge distillation approach, which can be formulated as a sample-wise alignment problem with intra- and inter-sample constraints. Unlike traditional knowledge distillation methods that concentrate on maximizing feature similarities or preserving class-wise semantic correlations between teacher and student features, our method attempts to recover the "dark knowledge" by aligning sample-wise teacher and student logits. Specifically, our method first minimizes logit differences within the same sample by considering their numerical values, thus preserving intra-sample similarities. Next, we bridge semantic disparities by leveraging dissimilarities across different samples. Note that constraints on intra-sample similarities and inter-sample dissimilarities can be efficiently and effectively reformulated into a contrastive learning framework with newly designed positive and negative pairs. The positive pair consists of the teacher's and student's logits derived from an identical sample, while the negative pairs are formed by using logits from different samples. With this formulation, our method benefits from the simplicity and efficiency of contrastive learning through the optimization of InfoNCE, yielding a run-time complexity that is far less than $O(n^2)$, where $n$ represents the total number of training samples. Furthermore, our method can eliminate the need for hyperparameter tuning, particularly related to temperature parameters and large batch sizes. We conduct comprehensive experiments on three datasets including CIFAR-100, ImageNet-1K, and MS COCO. Experimental results clearly confirm the effectiveness of the proposed method on both image classification and object detection tasks. Our source codes will be publicly available at https://github.com/wencheng-zhu/CKD.

[119]  arXiv:2404.14132 [pdf, other]
Title: CRNet: A Detail-Preserving Network for Unified Image Restoration and Enhancement Task
Comments: This paper is accepted by CVPR2024 Workshop, Code: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

In real-world scenarios, images captured often suffer from blurring, noise, and other forms of image degradation, and due to sensor limitations, people usually can only obtain low dynamic range images. To achieve high-quality images, researchers have attempted various image restoration and enhancement operations on photographs, including denoising, deblurring, and high dynamic range imaging. However, merely performing a single type of image enhancement still cannot yield satisfactory images. In this paper, to deal with the challenge above, we propose the Composite Refinement Network (CRNet) to address this issue using multiple exposure images. By fully integrating information-rich multiple exposure inputs, CRNet can perform unified image restoration and enhancement. To improve the quality of image details, CRNet explicitly separates and strengthens high and low-frequency information through pooling layers, using specially designed Multi-Branch Blocks for effective fusion of these frequencies. To increase the receptive field and fully integrate input features, CRNet employs the High-Frequency Enhancement Module, which includes large kernel convolutions and an inverted bottleneck ConvFFN. Our model secured third place in the first track of the Bracketing Image Restoration and Enhancement Challenge, surpassing previous SOTA models in both testing metrics and visual quality.

[120]  arXiv:2404.14135 [pdf, other]
Title: Text in the Dark: Extremely Low-Light Text Image Enhancement
Comments: The first two authors contributed equally to this work
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Extremely low-light text images are common in natural scenes, making scene text detection and recognition challenging. One solution is to enhance these images using low-light image enhancement methods before text extraction. However, previous methods often do not try to particularly address the significance of low-level features, which are crucial for optimal performance on downstream scene text tasks. Further research is also hindered by the lack of extremely low-light text datasets. To address these limitations, we propose a novel encoder-decoder framework with an edge-aware attention module to focus on scene text regions during enhancement. Our proposed method uses novel text detection and edge reconstruction losses to emphasize low-level scene text features, leading to successful text extraction. Additionally, we present a Supervised Deep Curve Estimation (Supervised-DCE) model to synthesize extremely low-light images based on publicly available scene text datasets such as ICDAR15 (IC15). We also labeled texts in the extremely low-light See In the Dark (SID) and ordinary LOw-Light (LOL) datasets to allow for objective assessment of extremely low-light image enhancement through scene text tasks. Extensive experiments show that our model outperforms state-of-the-art methods in terms of both image quality and scene text metrics on the widely-used LOL, SID, and synthetic IC15 datasets. Code and dataset will be released publicly at https://github.com/chunchet-ng/Text-in-the-Dark.

[121]  arXiv:2404.14162 [pdf, other]
Title: FLDM-VTON: Faithful Latent Diffusion Model for Virtual Try-on
Comments: Accepted by IJCAI 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Despite their impressive generative performance, latent diffusion model-based virtual try-on (VTON) methods lack faithfulness to crucial details of the clothes, such as style, pattern, and text. To alleviate these issues caused by the diffusion stochastic nature and latent supervision, we propose a novel Faithful Latent Diffusion Model for VTON, termed FLDM-VTON. FLDM-VTON improves the conventional latent diffusion process in three major aspects. First, we propose incorporating warped clothes as both the starting point and local condition, supplying the model with faithful clothes priors. Second, we introduce a novel clothes flattening network to constrain generated try-on images, providing clothes-consistent faithful supervision. Third, we devise a clothes-posterior sampling for faithful inference, further enhancing the model performance over conventional clothes-agnostic Gaussian sampling. Extensive experimental results on the benchmark VITON-HD and Dress Code datasets demonstrate that our FLDM-VTON outperforms state-of-the-art baselines and is able to generate photo-realistic try-on images with faithful clothing details.

[122]  arXiv:2404.14177 [pdf, other]
Title: Face2Face: Label-driven Facial Retouching Restoration
Subjects: Computer Vision and Pattern Recognition (cs.CV)

With the popularity of social media platforms such as Instagram and TikTok, and the widespread availability and convenience of retouching tools, an increasing number of individuals are utilizing these tools to beautify their facial photographs. This poses challenges for fields that place high demands on the authenticity of photographs, such as identity verification and social media. By altering facial images, users can easily create deceptive images, leading to the dissemination of false information. This may pose challenges to the reliability of identity verification systems and social media, and even lead to online fraud. To address this issue, some work has proposed makeup removal methods, but they still lack the ability to restore images involving geometric deformations caused by retouching. To tackle the problem of facial retouching restoration, we propose a framework, dubbed Face2Face, which consists of three components: a facial retouching detector, an image restoration model named FaceR, and a color correction module called Hierarchical Adaptive Instance Normalization (H-AdaIN). Firstly, the facial retouching detector predicts a retouching label containing three integers, indicating the retouching methods and their corresponding degrees. Then FaceR restores the retouched image based on the predicted retouching label. Finally, H-AdaIN is applied to address the issue of color shift arising from diffusion models. Extensive experiments demonstrate the effectiveness of our framework and each module.

[123]  arXiv:2404.14198 [pdf, ps, other]
Title: BCFPL: Binary classification ConvNet based Fast Parking space recognition with Low resolution image
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

The automobile plays an important role in the economic activities of mankind, especially in the metropolis. Under the circumstances, the demand of quick search for available parking spaces has become a major concern for the automobile drivers. Meanwhile, the public sense of privacy is also awaking, the image-based parking space recognition methods lack the attention of privacy protection. In this paper, we proposed a binary convolutional neural network with lightweight design structure named BCFPL, which can be used to train with low-resolution parking space images and offer a reasonable recognition result. The images of parking space were collected from various complex environments, including different weather, occlusion conditions, and various camera angles. We conducted the training and testing progresses among different datasets and partial subsets. The experimental results show that the accuracy of BCFPL does not decrease compared with the original resolution image directly, and can reach the average level of the existing mainstream method. BCFPL also has low hardware requirements and fast recognition speed while meeting the privacy requirements, so it has application potential in intelligent city construction and automatic driving field.

[124]  arXiv:2404.14199 [pdf, other]
Title: Generalizable Neural Human Renderer
Subjects: Computer Vision and Pattern Recognition (cs.CV)

While recent advancements in animatable human rendering have achieved remarkable results, they require test-time optimization for each subject which can be a significant limitation for real-world applications. To address this, we tackle the challenging task of learning a Generalizable Neural Human Renderer (GNH), a novel method for rendering animatable humans from monocular video without any test-time optimization. Our core method focuses on transferring appearance information from the input video to the output image plane by utilizing explicit body priors and multi-view geometry. To render the subject in the intended pose, we utilize a straightforward CNN-based image renderer, foregoing the more common ray-sampling or rasterizing-based rendering modules. Our GNH achieves remarkable generalizable, photorealistic rendering with unseen subjects with a three-stage process. We quantitatively and qualitatively demonstrate that GNH significantly surpasses current state-of-the-art methods, notably achieving a 31.3% improvement in LPIPS.

[125]  arXiv:2404.14233 [pdf, other]
Title: Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

The rapidly developing Large Vision Language Models (LVLMs) have shown notable capabilities on a range of multi-modal tasks, but still face the hallucination phenomena where the generated texts do not align with the given contexts, significantly restricting the usages of LVLMs. Most previous work detects and mitigates hallucination at the coarse-grained level or requires expensive annotation (e.g., labeling by proprietary models or human experts). To address these issues, we propose detecting and mitigating hallucinations in LVLMs via fine-grained AI feedback. The basic idea is that we generate a small-size sentence-level hallucination annotation dataset by proprietary models, whereby we train a hallucination detection model which can perform sentence-level hallucination detection, covering primary hallucination types (i.e., object, attribute, and relationship). Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for training hallucination mitigating model. Furthermore, we propose differentiating the severity of hallucinations, and introducing a Hallucination Severity-Aware Direct Preference Optimization (HSA-DPO) for mitigating hallucination in LVLMs by incorporating the severity of hallucinations into preference learning. Extensive experiments demonstrate the effectiveness of our method.

[126]  arXiv:2404.14239 [pdf, other]
Title: MultiBooth: Towards Generating All Your Concepts in an Image from Text
Comments: Project Page: this https URL . Github Page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper introduces MultiBooth, a novel and efficient technique for multi-concept customization in image generation from text. Despite the significant advancements in customized generation methods, particularly with the success of diffusion models, existing methods often struggle with multi-concept scenarios due to low concept fidelity and high inference cost. MultiBooth addresses these issues by dividing the multi-concept generation process into two phases: a single-concept learning phase and a multi-concept integration phase. During the single-concept learning phase, we employ a multi-modal image encoder and an efficient concept encoding technique to learn a concise and discriminative representation for each concept. In the multi-concept integration phase, we use bounding boxes to define the generation area for each concept within the cross-attention map. This method enables the creation of individual concepts within their specified regions, thereby facilitating the formation of multi-concept images. This strategy not only improves concept fidelity but also reduces additional inference cost. MultiBooth surpasses various baselines in both qualitative and quantitative evaluations, showcasing its superior performance and computational efficiency. Project Page: https://multibooth.github.io/

[127]  arXiv:2404.14241 [pdf, other]
Title: UrbanCross: Enhancing Satellite Image-Text Retrieval with Cross-Domain Adaptation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Urbanization challenges underscore the necessity for effective satellite image-text retrieval methods to swiftly access specific information enriched with geographic semantics for urban applications. However, existing methods often overlook significant domain gaps across diverse urban landscapes, primarily focusing on enhancing retrieval performance within single domains. To tackle this issue, we present UrbanCross, a new framework for cross-domain satellite image-text retrieval. UrbanCross leverages a high-quality, cross-domain dataset enriched with extensive geo-tags from three countries to highlight domain diversity. It employs the Large Multimodal Model (LMM) for textual refinement and the Segment Anything Model (SAM) for visual augmentation, achieving a fine-grained alignment of images, segments and texts, yielding a 10% improvement in retrieval performance. Additionally, UrbanCross incorporates an adaptive curriculum-based source sampler and a weighted adversarial cross-domain fine-tuning module, progressively enhancing adaptability across various domains. Extensive experiments confirm UrbanCross's superior efficiency in retrieval and adaptation to new urban environments, demonstrating an average performance increase of 15% over its version without domain adaptation mechanisms, effectively bridging the domain gap.

[128]  arXiv:2404.14247 [pdf, other]
Title: From Modalities to Styles: Rethinking the Domain Gap in Heterogeneous Face Recognition
Comments: Accepted for publication in IEEE TBIOM
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Heterogeneous Face Recognition (HFR) focuses on matching faces from different domains, for instance, thermal to visible images, making Face Recognition (FR) systems more versatile for challenging scenarios. However, the domain gap between these domains and the limited large-scale datasets in the target HFR modalities make it challenging to develop robust HFR models from scratch. In our work, we view different modalities as distinct styles and propose a method to modulate feature maps of the target modality to address the domain gap. We present a new Conditional Adaptive Instance Modulation (CAIM ) module that seamlessly fits into existing FR networks, turning them into HFR-ready systems. The CAIM block modulates intermediate feature maps, efficiently adapting to the style of the source modality and bridging the domain gap. Our method enables end-to-end training using a small set of paired samples. We extensively evaluate the proposed approach on various challenging HFR benchmarks, showing that it outperforms state-of-the-art methods. The source code and protocols for reproducing the findings will be made publicly available

[129]  arXiv:2404.14248 [pdf, other]
Title: NTIRE 2024 Challenge on Low Light Image Enhancement: Methods and Results
Comments: NTIRE 2024 Challenge Report
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper reviews the NTIRE 2024 low light image enhancement challenge, highlighting the proposed solutions and results. The aim of this challenge is to discover an effective network design or solution capable of generating brighter, clearer, and visually appealing results when dealing with a variety of conditions, including ultra-high resolution (4K and beyond), non-uniform illumination, backlighting, extreme darkness, and night scenes. A notable total of 428 participants registered for the challenge, with 22 teams ultimately making valid submissions. This paper meticulously evaluates the state-of-the-art advancements in enhancing low-light images, reflecting the significant progress and creativity in this field.

[130]  arXiv:2404.14249 [pdf, other]
Title: CLIP-GS: CLIP-Informed Gaussian Splatting for Real-time and View-consistent 3D Semantic Understanding
Comments: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The recent 3D Gaussian Splatting (GS) exhibits high-quality and real-time synthesis of novel views in 3D scenes. Currently, it primarily focuses on geometry and appearance modeling, while lacking the semantic understanding of scenes. To bridge this gap, we present CLIP-GS, which integrates semantics from Contrastive Language-Image Pre-Training (CLIP) into Gaussian Splatting to efficiently comprehend 3D environments without annotated semantic data. In specific, rather than straightforwardly learning and rendering high-dimensional semantic features of 3D Gaussians, which significantly diminishes the efficiency, we propose a Semantic Attribute Compactness (SAC) approach. SAC exploits the inherent unified semantics within objects to learn compact yet effective semantic representations of 3D Gaussians, enabling highly efficient rendering (>100 FPS). Additionally, to address the semantic ambiguity, caused by utilizing view-inconsistent 2D CLIP semantics to supervise Gaussians, we introduce a 3D Coherent Self-training (3DCS) strategy, resorting to the multi-view consistency originated from the 3D model. 3DCS imposes cross-view semantic consistency constraints by leveraging refined, self-predicted pseudo-labels derived from the trained 3D Gaussian model, thereby enhancing precise and view-consistent segmentation results. Extensive experiments demonstrate that our method remarkably outperforms existing state-of-the-art approaches, achieving improvements of 17.29% and 20.81% in mIoU metric on Replica and ScanNet datasets, respectively, while maintaining real-time rendering speed. Furthermore, our approach exhibits superior performance even with sparse input data, verifying the robustness of our method.

[131]  arXiv:2404.14279 [pdf, other]
Title: Co-designing a Sub-millisecond Latency Event-based Eye Tracking System with Submanifold Sparse CNN
Comments: Accepted to CVPR 2024 workshop, AIS: Vision, Graphics, and AI for Streaming
Subjects: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR)

Eye-tracking technology is integral to numerous consumer electronics applications, particularly in the realm of virtual and augmented reality (VR/AR). These applications demand solutions that excel in three crucial aspects: low-latency, low-power consumption, and precision. Yet, achieving optimal performance across all these fronts presents a formidable challenge, necessitating a balance between sophisticated algorithms and efficient backend hardware implementations. In this study, we tackle this challenge through a synergistic software/hardware co-design of the system with an event camera. Leveraging the inherent sparsity of event-based input data, we integrate a novel sparse FPGA dataflow accelerator customized for submanifold sparse convolution neural networks (SCNN). The SCNN implemented on the accelerator can efficiently extract the embedding feature vector from each representation of event slices by only processing the non-zero activations. Subsequently, these vectors undergo further processing by a gated recurrent unit (GRU) and a fully connected layer on the host CPU to generate the eye centers. Deployment and evaluation of our system reveal outstanding performance metrics. On the Event-based Eye-Tracking-AIS2024 dataset, our system achieves 81% p5 accuracy, 99.5% p10 accuracy, and 3.71 Mean Euclidean Distance with 0.7 ms latency while only consuming 2.29 mJ per inference. Notably, our solution opens up opportunities for future eye-tracking systems. Code is available at https://github.com/CASR-HKU/ESDA/tree/eye_tracking.

[132]  arXiv:2404.14280 [pdf, other]
Title: RESFM: Robust Equivariant Multiview Structure from Motion
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Multiview Structure from Motion is a fundamental and challenging computer vision problem. A recent deep-based approach was proposed utilizing matrix equivariant architectures for the simultaneous recovery of camera pose and 3D scene structure from large image collections. This work however made the unrealistic assumption that the point tracks given as input are clean of outliers. Here we propose an architecture suited to dealing with outliers by adding an inlier/outlier classifying module that respects the model equivariance and by adding a robust bundle adjustment step. Experiments demonstrate that our method can be successfully applied in realistic settings that include large image collections and point tracks extracted with common heuristics and include many outliers.

[133]  arXiv:2404.14309 [pdf, other]
Title: Towards Better Adversarial Purification via Adversarial Denoising Diffusion Training
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recently, diffusion-based purification (DBP) has emerged as a promising approach for defending against adversarial attacks. However, previous studies have used questionable methods to evaluate the robustness of DBP models, their explanations of DBP robustness also lack experimental support. We re-examine DBP robustness using precise gradient, and discuss the impact of stochasticity on DBP robustness. To better explain DBP robustness, we assess DBP robustness under a novel attack setting, Deterministic White-box, and pinpoint stochasticity as the main factor in DBP robustness. Our results suggest that DBP models rely on stochasticity to evade the most effective attack direction, rather than directly countering adversarial perturbations. To improve the robustness of DBP models, we propose Adversarial Denoising Diffusion Training (ADDT). This technique uses Classifier-Guided Perturbation Optimization (CGPO) to generate adversarial perturbation through guidance from a pre-trained classifier, and uses Rank-Based Gaussian Mapping (RBGM) to convert adversarial pertubation into a normal Gaussian distribution. Empirical results show that ADDT improves the robustness of DBP models. Further experiments confirm that ADDT equips DBP models with the ability to directly counter adversarial perturbations.

[134]  arXiv:2404.14329 [pdf, other]
Title: X-Ray: A Sequential 3D Representation for Generation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we introduce X-Ray, an innovative approach to 3D generation that employs a new sequential representation, drawing inspiration from the depth-revealing capabilities of X-Ray scans to meticulously capture both the external and internal features of objects. Central to our method is the utilization of ray casting techniques originating from the camera's viewpoint, meticulously recording the geometric and textural details encountered across all intersected surfaces. This process efficiently condenses complete objects or scenes into a multi-frame format, just like videos. Such a structure ensures the 3D representation is composed solely of critical surface information. Highlighting the practicality and adaptability of our X-Ray representation, we showcase its utility in synthesizing 3D objects, employing a network architecture akin to that used in video diffusion models. The outcomes reveal our representation's superior performance in enhancing both the accuracy and efficiency of 3D synthesis, heralding new directions for ongoing research and practical implementations in the field.

[135]  arXiv:2404.14343 [pdf, other]
Title: Heterogeneous Face Recognition Using Domain Invariant Units
Comments: 6 pages, Accepted ICASSP 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Heterogeneous Face Recognition (HFR) aims to expand the applicability of Face Recognition (FR) systems to challenging scenarios, enabling the matching of face images across different domains, such as matching thermal images to visible spectra. However, the development of HFR systems is challenging because of the significant domain gap between modalities and the lack of availability of large-scale paired multi-channel data. In this work, we leverage a pretrained face recognition model as a teacher network to learn domaininvariant network layers called Domain-Invariant Units (DIU) to reduce the domain gap. The proposed DIU can be trained effectively even with a limited amount of paired training data, in a contrastive distillation framework. This proposed approach has the potential to enhance pretrained models, making them more adaptable to a wider range of variations in data. We extensively evaluate our approach on multiple challenging benchmarks, demonstrating superior performance compared to state-of-the-art methods.

[136]  arXiv:2404.14344 [pdf, other]
Title: On-the-Fly Point Annotation for Fast Medical Video Labeling
Comments: 7 pages, 5 figures. Int J CARS (2024)
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Purpose: In medical research, deep learning models rely on high-quality annotated data, a process often laborious and timeconsuming. This is particularly true for detection tasks where bounding box annotations are required. The need to adjust two corners makes the process inherently frame-by-frame. Given the scarcity of experts' time, efficient annotation methods suitable for clinicians are needed. Methods: We propose an on-the-fly method for live video annotation to enhance the annotation efficiency. In this approach, a continuous single-point annotation is maintained by keeping the cursor on the object in a live video, mitigating the need for tedious pausing and repetitive navigation inherent in traditional annotation methods. This novel annotation paradigm inherits the point annotation's ability to generate pseudo-labels using a point-to-box teacher model. We empirically evaluate this approach by developing a dataset and comparing on-the-fly annotation time against traditional annotation method. Results: Using our method, annotation speed was 3.2x faster than the traditional annotation technique. We achieved a mean improvement of 6.51 +- 0.98 AP@50 over conventional method at equivalent annotation budgets on the developed dataset. Conclusion: Without bells and whistles, our approach offers a significant speed-up in annotation tasks. It can be easily implemented on any annotation platform to accelerate the integration of deep learning in video-based medical research.

[137]  arXiv:2404.14349 [pdf, other]
Title: Automatic Discovery of Visual Circuits
Comments: 14 pages, 11 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

To date, most discoveries of network subcomponents that implement human-interpretable computations in deep vision models have involved close study of single units and large amounts of human labor. We explore scalable methods for extracting the subgraph of a vision model's computational graph that underlies recognition of a specific visual concept. We introduce a new method for identifying these subgraphs: specifying a visual concept using a few examples, and then tracing the interdependence of neuron activations across layers, or their functional connectivity. We find that our approach extracts circuits that causally affect model output, and that editing these circuits can defend large pretrained models from adversarial attacks.

[138]  arXiv:2404.14351 [pdf, other]
Title: Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer
Comments: Project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We address the task of estimating camera parameters from a set of images depicting a scene. Popular feature-based structure-from-motion (SfM) tools solve this task by incremental reconstruction: they repeat triangulation of sparse 3D points and registration of more camera views to the sparse point cloud. We re-interpret incremental structure-from-motion as an iterated application and refinement of a visual relocalizer, that is, of a method that registers new views to the current state of the reconstruction. This perspective allows us to investigate alternative visual relocalizers that are not rooted in local feature matching. We show that scene coordinate regression, a learning-based relocalization approach, allows us to build implicit, neural scene representations from unposed images. Different from other learning-based reconstruction methods, we do not require pose priors nor sequential inputs, and we optimize efficiently over thousands of images. Our method, ACE0 (ACE Zero), estimates camera poses to an accuracy comparable to feature-based SfM, as demonstrated by novel view synthesis. Project page: https://nianticlabs.github.io/acezero/

[139]  arXiv:2404.14368 [pdf, other]
Title: Graphic Design with Large Multimodal Model
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

In the field of graphic design, automating the integration of design elements into a cohesive multi-layered artwork not only boosts productivity but also paves the way for the democratization of graphic design. One existing practice is Graphic Layout Generation (GLG), which aims to layout sequential design elements. It has been constrained by the necessity for a predefined correct sequence of layers, thus limiting creative potential and increasing user workload. In this paper, we present Hierarchical Layout Generation (HLG) as a more flexible and pragmatic setup, which creates graphic composition from unordered sets of design elements. To tackle the HLG task, we introduce Graphist, the first layout generation model based on large multimodal models. Graphist efficiently reframes the HLG as a sequence generation problem, utilizing RGB-A images as input, outputs a JSON draft protocol, indicating the coordinates, size, and order of each element. We develop new evaluation metrics for HLG. Graphist outperforms prior arts and establishes a strong baseline for this field. Project homepage: https://github.com/graphic-design-ai/graphist

[140]  arXiv:2404.14381 [pdf, other]
Title: TAVGBench: Benchmarking Text to Audible-Video Generation
Comments: Technical Report. Project page:this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

The Text to Audible-Video Generation (TAVG) task involves generating videos with accompanying audio based on text descriptions. Achieving this requires skillful alignment of both audio and video elements. To support research in this field, we have developed a comprehensive Text to Audible-Video Generation Benchmark (TAVGBench), which contains over 1.7 million clips with a total duration of 11.8 thousand hours. We propose an automatic annotation pipeline to ensure each audible video has detailed descriptions for both its audio and video contents. We also introduce the Audio-Visual Harmoni score (AVHScore) to provide a quantitative measure of the alignment between the generated audio and video modalities. Additionally, we present a baseline model for TAVG called TAVDiffusion, which uses a two-stream latent diffusion model to provide a fundamental starting point for further research in this area. We achieve the alignment of audio and video by employing cross-attention and contrastive learning. Through extensive experiments and evaluations on TAVGBench, we demonstrate the effectiveness of our proposed model under both conventional metrics and our proposed metrics.

[141]  arXiv:2404.14396 [pdf, other]
Title: SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
Comments: Project released at: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The rapid evolution of multimodal foundation model has demonstrated significant progresses in vision-language understanding and generation, e.g., our previous work SEED-LLaMA. However, there remains a gap between its capability and the real-world applicability, primarily due to the model's limited capacity to effectively respond to various user instructions and interact with diverse visual data. In this work, we focus on bridging this gap through integrating two enhanced features: (1) comprehending images of arbitrary sizes and ratios, and (2) enabling multi-granularity image generation. We present a unified and versatile foundation model, namely, SEED-X, which is able to model multi-granularity visual semantics for comprehension and generation tasks. Besides the competitive results on public benchmarks, SEED-X demonstrates its effectiveness in handling real-world applications across various domains after instruction tuning. We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications. The models, codes, and datasets will be released in https://github.com/AILab-CVC/SEED-X.

[142]  arXiv:2404.14403 [pdf, other]
Title: GeoDiffuser: Geometry-Based Image Editing with Diffusion Models
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The success of image generative models has enabled us to build methods that can edit images based on text or other user input. However, these methods are bespoke, imprecise, require additional information, or are limited to only 2D image edits. We present GeoDiffuser, a zero-shot optimization-based method that unifies common 2D and 3D image-based object editing capabilities into a single method. Our key insight is to view image editing operations as geometric transformations. We show that these transformations can be directly incorporated into the attention layers in diffusion models to implicitly perform editing operations. Our training-free optimization method uses an objective function that seeks to preserve object style but generate plausible images, for instance with accurate lighting and shadows. It also inpaints disoccluded parts of the image where the object was originally located. Given a natural image and user input, we segment the foreground object using SAM and estimate a corresponding transform which is used by our optimization approach for editing. GeoDiffuser can perform common 2D and 3D edits like object translation, 3D rotation, and removal. We present quantitative results, including a perceptual study, that shows how our approach is better than existing methods. Visit https://ivl.cs.brown.edu/research/geodiffuser.html for more information.

[143]  arXiv:2404.14406 [pdf, other]
Title: Hyp-OC: Hyperbolic One Class Classification for Face Anti-Spoofing
Comments: Accepted in FG2024, Project Page - this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Face recognition technology has become an integral part of modern security systems and user authentication processes. However, these systems are vulnerable to spoofing attacks and can easily be circumvented. Most prior research in face anti-spoofing (FAS) approaches it as a two-class classification task where models are trained on real samples and known spoof attacks and tested for detection performance on unknown spoof attacks. However, in practice, FAS should be treated as a one-class classification task where, while training, one cannot assume any knowledge regarding the spoof samples a priori. In this paper, we reformulate the face anti-spoofing task from a one-class perspective and propose a novel hyperbolic one-class classification framework. To train our network, we use a pseudo-negative class sampled from the Gaussian distribution with a weighted running mean and propose two novel loss functions: (1) Hyp-PC: Hyperbolic Pairwise Confusion loss, and (2) Hyp-CE: Hyperbolic Cross Entropy loss, which operate in the hyperbolic space. Additionally, we employ Euclidean feature clipping and gradient clipping to stabilize the training in the hyperbolic space. To the best of our knowledge, this is the first work extending hyperbolic embeddings for face anti-spoofing in a one-class manner. With extensive experiments on five benchmark datasets: Rose-Youtu, MSU-MFSD, CASIA-MFSD, Idiap Replay-Attack, and OULU-NPU, we demonstrate that our method significantly outperforms the state-of-the-art, achieving better spoof detection performance.

[144]  arXiv:2404.14409 [pdf, other]
Title: CrossScore: Towards Multi-View Image Evaluation and Scoring
Comments: Project page see this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We introduce a novel cross-reference image quality assessment method that effectively fills the gap in the image assessment landscape, complementing the array of established evaluation schemes -- ranging from full-reference metrics like SSIM, no-reference metrics such as NIQE, to general-reference metrics including FID, and Multi-modal-reference metrics, e.g., CLIPScore. Utilising a neural network with the cross-attention mechanism and a unique data collection pipeline from NVS optimisation, our method enables accurate image quality assessment without requiring ground truth references. By comparing a query image against multiple views of the same scene, our method addresses the limitations of existing metrics in novel view synthesis (NVS) and similar tasks where direct reference images are unavailable. Experimental results show that our method is closely correlated to the full-reference metric SSIM, while not requiring ground truth references.

[145]  arXiv:2404.14410 [pdf, other]
Title: Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses
Comments: The project page is available at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we present a method to reconstruct the world and multiple dynamic humans in 3D from a monocular video input. As a key idea, we represent both the world and multiple humans via the recently emerging 3D Gaussian Splatting (3D-GS) representation, enabling to conveniently and efficiently compose and render them together. In particular, we address the scenarios with severely limited and sparse observations in 3D human reconstruction, a common challenge encountered in the real world. To tackle this challenge, we introduce a novel approach to optimize the 3D-GS representation in a canonical space by fusing the sparse cues in the common space, where we leverage a pre-trained 2D diffusion model to synthesize unseen views while keeping the consistency with the observed 2D appearances. We demonstrate our method can reconstruct high-quality animatable 3D humans in various challenging examples, in the presence of occlusion, image crops, few-shot, and extremely sparse observations. After reconstruction, our method is capable of not only rendering the scene in any novel views at arbitrary time instances, but also editing the 3D scene by removing individual humans or applying different motions for each human. Through various experiments, we demonstrate the quality and efficiency of our methods over alternative existing approaches.

[146]  arXiv:2404.14412 [pdf, other]
Title: AutoAD III: The Prequel -- Back to the Pixels
Comments: CVPR2024. Project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models; and (iii) We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance. Taken together, we improve the state of the art on AD generation.

Cross-lists for Tue, 23 Apr 24

[147]  arXiv:2404.13097 (cross-list from eess.IV) [pdf, other]
Title: DISC: Latent Diffusion Models with Self-Distillation from Separated Conditions for Prostate Cancer Grading
Comments: Abstract accepted for ISBI 2024. Extended version to be presented at SynData4CV @ CVPR 2024. See more at this https URL
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)

Latent Diffusion Models (LDMs) can generate high-fidelity images from noise, offering a promising approach for augmenting histopathology images for training cancer grading models. While previous works successfully generated high-fidelity histopathology images using LDMs, the generation of image tiles to improve prostate cancer grading has not yet been explored. Additionally, LDMs face challenges in accurately generating admixtures of multiple cancer grades in a tile when conditioned by a tile mask. In this study, we train specific LDMs to generate synthetic tiles that contain multiple Gleason Grades (GGs) by leveraging pixel-wise annotations in input tiles. We introduce a novel framework named Self-Distillation from Separated Conditions (DISC) that generates GG patterns guided by GG masks. Finally, we deploy a training framework for pixel-level and slide-level prostate cancer grading, where synthetic tiles are effectively utilized to improve the cancer grading performance of existing models. As a result, this work surpasses previous works in two domains: 1) our LDMs enhanced with DISC produce more accurate tiles in terms of GG patterns, and 2) our training scheme, incorporating synthetic data, significantly improves the generalization of the baseline model for prostate cancer grading, particularly in challenging cases of rare GG5, demonstrating the potential of generative models to enhance cancer grading when data is limited.

[148]  arXiv:2404.13101 (cross-list from eess.IV) [pdf, ps, other]
Title: DensePANet: An improved generative adversarial network for photoacoustic tomography image reconstruction from sparse data
Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Image reconstruction is an essential step of every medical imaging method, including Photoacoustic Tomography (PAT), which is a promising modality of imaging, that unites the benefits of both ultrasound and optical imaging methods. Reconstruction of PAT images using conventional methods results in rough artifacts, especially when applied directly to sparse PAT data. In recent years, generative adversarial networks (GANs) have shown a powerful performance in image generation as well as translation, rendering them a smart choice to be applied to reconstruction tasks. In this study, we proposed an end-to-end method called DensePANet to solve the problem of PAT image reconstruction from sparse data. The proposed model employs a novel modification of UNet in its generator, called FD-UNet++, which considerably improves the reconstruction performance. We evaluated the method on various in-vivo and simulated datasets. Quantitative and qualitative results show the better performance of our model over other prevalent deep learning techniques.

[149]  arXiv:2404.13102 (cross-list from eess.IV) [pdf, other]
Title: Single-sample image-fusion upsampling of fluorescence lifetime images
Comments: 18 pages, 11 figures. To be published in Science Advances
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Biological Physics (physics.bio-ph); Optics (physics.optics)

Fluorescence lifetime imaging microscopy (FLIM) provides detailed information about molecular interactions and biological processes. A major bottleneck for FLIM is image resolution at high acquisition speeds, due to the engineering and signal-processing limitations of time-resolved imaging technology. Here we present single-sample image-fusion upsampling (SiSIFUS), a data-fusion approach to computational FLIM super-resolution that combines measurements from a low-resolution time-resolved detector (that measures photon arrival time) and a high-resolution camera (that measures intensity only). To solve this otherwise ill-posed inverse retrieval problem, we introduce statistically informed priors that encode local and global dependencies between the two single-sample measurements. This bypasses the risk of out-of-distribution hallucination as in traditional data-driven approaches and delivers enhanced images compared for example to standard bilinear interpolation. The general approach laid out by SiSIFUS can be applied to other image super-resolution problems where two different datasets are available.

[150]  arXiv:2404.13103 (cross-list from eess.IV) [pdf, other]
Title: ToNNO: Tomographic Reconstruction of a Neural Network's Output for Weakly Supervised Segmentation of 3D Medical Images
Comments: Accepted at CVPR 2024
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Annotating lots of 3D medical images for training segmentation models is time-consuming. The goal of weakly supervised semantic segmentation is to train segmentation models without using any ground truth segmentation masks. Our work addresses the case where only image-level categorical labels, indicating the presence or absence of a particular region of interest (such as tumours or lesions), are available. Most existing methods rely on class activation mapping (CAM). We propose a novel approach, ToNNO, which is based on the Tomographic reconstruction of a Neural Network's Output. Our technique extracts stacks of slices with different angles from the input 3D volume, feeds these slices to a 2D encoder, and applies the inverse Radon transform in order to reconstruct a 3D heatmap of the encoder's predictions. This generic method allows to perform dense prediction tasks on 3D volumes using any 2D image encoder. We apply it to weakly supervised medical image segmentation by training the 2D encoder to output high values for slices containing the regions of interest. We test it on four large scale medical image datasets and outperform 2D CAM methods. We then extend ToNNO by combining tomographic reconstruction with CAM methods, proposing Averaged CAM and Tomographic CAM, which obtain even better results.

[151]  arXiv:2404.13105 (cross-list from cs.DB) [pdf, other]
Title: On-Demand Earth System Data Cubes
Comments: Accepted at IGARSS24
Subjects: Databases (cs.DB); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Advancements in Earth system science have seen a surge in diverse datasets. Earth System Data Cubes (ESDCs) have been introduced to efficiently handle this influx of high-dimensional data. ESDCs offer a structured, intuitive framework for data analysis, organising information within spatio-temporal grids. The structured nature of ESDCs unlocks significant opportunities for Artificial Intelligence (AI) applications. By providing well-organised data, ESDCs are ideally suited for a wide range of sophisticated AI-driven tasks. An automated framework for creating AI-focused ESDCs with minimal user input could significantly accelerate the generation of task-specific training data. Here we introduce cubo, an open-source Python tool designed for easy generation of AI-focused ESDCs. Utilising collections in SpatioTemporal Asset Catalogs (STAC) that are stored as Cloud Optimised GeoTIFFs (COGs), cubo efficiently creates ESDCs, requiring only central coordinates, spatial resolution, edge size, and time range.

[152]  arXiv:2404.13106 (cross-list from eess.IV) [pdf, other]
Title: Automatic Cranial Defect Reconstruction with Self-Supervised Deep Deformable Masked Autoencoders
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Thousands of people suffer from cranial injuries every year. They require personalized implants that need to be designed and manufactured before the reconstruction surgery. The manual design is expensive and time-consuming leading to searching for algorithms whose goal is to automatize the process. The problem can be formulated as volumetric shape completion and solved by deep neural networks dedicated to supervised image segmentation. However, such an approach requires annotating the ground-truth defects which is costly and time-consuming. Usually, the process is replaced with synthetic defect generation. However, even the synthetic ground-truth generation is time-consuming and limits the data heterogeneity, thus the deep models' generalizability. In our work, we propose an alternative and simple approach to use a self-supervised masked autoencoder to solve the problem. This approach by design increases the heterogeneity of the training set and can be seen as a form of data augmentation. We compare the proposed method with several state-of-the-art deep neural networks and show both the quantitative and qualitative improvement on the SkullBreak and SkullFix datasets. The proposed method can be used to efficiently reconstruct the cranial defects in real time.

[153]  arXiv:2404.13108 (cross-list from eess.IV) [pdf, other]
Title: RegWSI: Whole Slide Image Registration using Combined Deep Feature- and Intensity-Based Methods: Winner of the ACROBAT 2023 Challenge
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

The automatic registration of differently stained whole slide images (WSIs) is crucial for improving diagnosis and prognosis by fusing complementary information emerging from different visible structures. It is also useful to quickly transfer annotations between consecutive or restained slides, thus significantly reducing the annotation time and associated costs. Nevertheless, the slide preparation is different for each stain and the tissue undergoes complex and large deformations. Therefore, a robust, efficient, and accurate registration method is highly desired by the scientific community and hospitals specializing in digital pathology. We propose a two-step hybrid method consisting of (i) deep learning- and feature-based initial alignment algorithm, and (ii) intensity-based nonrigid registration using the instance optimization. The proposed method does not require any fine-tuning to a particular dataset and can be used directly for any desired tissue type and stain. The method scored 1st place in the ACROBAT 2023 challenge. We evaluated using three open datasets: (i) ANHIR, (ii) ACROBAT, and (iii) HyReCo, and performed several ablation studies concerning the resolution used for registration and the initial alignment robustness and stability. The method achieves the most accurate results for the ACROBAT dataset, the cell-level registration accuracy for the restained slides from the HyReCo dataset, and is among the best methods evaluated on the ANHIR dataset. The method does not require any fine-tuning to a new datasets and can be used out-of-the-box for other types of microscopic images. The method is incorporated into the DeeperHistReg framework, allowing others to directly use it to register, transform, and save the WSIs at any desired pyramid level. The proposed method is a significant contribution to the WSI registration, thus advancing the field of digital pathology.

[154]  arXiv:2404.13134 (cross-list from cs.MM) [pdf, other]
Title: Deep Learning-based Text-in-Image Watermarking
Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

In this work, we introduce a novel deep learning-based approach to text-in-image watermarking, a method that embeds and extracts textual information within images to enhance data security and integrity. Leveraging the capabilities of deep learning, specifically through the use of Transformer-based architectures for text processing and Vision Transformers for image feature extraction, our method sets new benchmarks in the domain. The proposed method represents the first application of deep learning in text-in-image watermarking that improves adaptivity, allowing the model to intelligently adjust to specific image characteristics and emerging threats. Through testing and evaluation, our method has demonstrated superior robustness compared to traditional watermarking techniques, achieving enhanced imperceptibility that ensures the watermark remains undetectable across various image contents.

[155]  arXiv:2404.13146 (cross-list from cs.CR) [pdf, other]
Title: DeepFake-O-Meter v2.0: An Open Platform for DeepFake Detection
Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

Deepfakes, as AI-generated media, have increasingly threatened media integrity and personal privacy with realistic yet fake digital content. In this work, we introduce an open-source and user-friendly online platform, DeepFake-O-Meter v2.0, that integrates state-of-the-art methods for detecting Deepfake images, videos, and audio. Built upon DeepFake-O-Meter v1.0, we have made significant upgrades and improvements in platform architecture design, including user interaction, detector integration, job balancing, and security management. The platform aims to offer everyday users a convenient service for analyzing DeepFake media using multiple state-of-the-art detection algorithms. It ensures secure and private delivery of the analysis results. Furthermore, it serves as an evaluation and benchmarking platform for researchers in digital media forensics to compare the performance of multiple algorithms on the same input. We have also conducted detailed usage analysis based on the collected data to gain deeper insights into our platform's statistics. This involves analyzing two-month trends in user activity and evaluating the processing efficiency of each detector.

[156]  arXiv:2404.13153 (cross-list from eess.IV) [pdf, other]
Title: Motion-adaptive Separable Collaborative Filters for Blind Motion Deblurring
Comments: CVPR 2024
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Eliminating image blur produced by various kinds of motion has been a challenging problem. Dominant approaches rely heavily on model capacity to remove blurring by reconstructing residual from blurry observation in feature space. These practices not only prevent the capture of spatially variable motion in the real world but also ignore the tailored handling of various motions in image space. In this paper, we propose a novel real-world deblurring filtering model called the Motion-adaptive Separable Collaborative (MISC) Filter. In particular, we use a motion estimation network to capture motion information from neighborhoods, thereby adaptively estimating spatially-variant motion flow, mask, kernels, weights, and offsets to obtain the MISC Filter. The MISC Filter first aligns the motion-induced blurring patterns to the motion middle along the predicted flow direction, and then collaboratively filters the aligned image through the predicted kernels, weights, and offsets to generate the output. This design can handle more generalized and complex motion in a spatially differentiated manner. Furthermore, we analyze the relationships between the motion estimation network and the residual reconstruction network. Extensive experiments on four widely used benchmarks demonstrate that our method provides an effective solution for real-world motion blur removal and achieves state-of-the-art performance. Code is available at https://github.com/ChengxuLiu/MISCFilter

[157]  arXiv:2404.13185 (cross-list from eess.IV) [pdf, other]
Title: Unlocking Robust Segmentation Across All Age Groups via Continual Learning
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Most deep learning models in medical imaging are trained on adult data with unclear performance on pediatric images. In this work, we aim to address this challenge in the context of automated anatomy segmentation in whole-body Computed Tomography (CT). We evaluate the performance of CT organ segmentation algorithms trained on adult data when applied to pediatric CT volumes and identify substantial age-dependent underperformance. We subsequently propose and evaluate strategies, including data augmentation and continual learning approaches, to achieve good segmentation accuracy across all age groups. Our best-performing model, trained using continual learning, achieves high segmentation accuracy on both adult and pediatric data (Dice scores of 0.90 and 0.84 respectively).

[158]  arXiv:2404.13194 (cross-list from cs.LG) [pdf, other]
Title: Privacy-Preserving Debiasing using Data Augmentation and Machine Unlearning
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Data augmentation is widely used to mitigate data bias in the training dataset. However, data augmentation exposes machine learning models to privacy attacks, such as membership inference attacks. In this paper, we propose an effective combination of data augmentation and machine unlearning, which can reduce data bias while providing a provable defense against known attacks. Specifically, we maintain the fairness of the trained model with diffusion-based data augmentation, and then utilize multi-shard unlearning to remove identifying information of original data from the ML model for protection against privacy attacks. Experimental evaluation across diverse datasets demonstrates that our approach can achieve significant improvements in bias reduction as well as robustness against state-of-the-art privacy attacks.

[159]  arXiv:2404.13222 (cross-list from eess.IV) [pdf, other]
Title: Vim4Path: Self-Supervised Vision Mamba for Histopathology Images
Comments: Accepted in CVPR2023 (9th Workshop on Computer Vision for Microscopy Image Analysis)
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Representation learning from Gigapixel Whole Slide Images (WSI) poses a significant challenge in computational pathology due to the complicated nature of tissue structures and the scarcity of labeled data. Multi-instance learning methods have addressed this challenge, leveraging image patches to classify slides utilizing pretrained models using Self-Supervised Learning (SSL) approaches. The performance of both SSL and MIL methods relies on the architecture of the feature encoder. This paper proposes leveraging the Vision Mamba (Vim) architecture, inspired by state space models, within the DINO framework for representation learning in computational pathology. We evaluate the performance of Vim against Vision Transformers (ViT) on the Camelyon16 dataset for both patch-level and slide-level classification. Our findings highlight Vim's enhanced performance compared to ViT, particularly at smaller scales, where Vim achieves an 8.21 increase in ROC AUC for models of similar size. An explainability analysis further highlights Vim's capabilities, which reveals that Vim uniquely emulates the pathologist workflow-unlike ViT. This alignment with human expert analysis highlights Vim's potential in practical diagnostic settings and contributes significantly to developing effective representation-learning algorithms in computational pathology. We release the codes and pretrained weights at \url{https://github.com/AtlasAnalyticsLab/Vim4Path}.

[160]  arXiv:2404.13277 (cross-list from eess.IV) [pdf, other]
Title: Beyond Score Changes: Adversarial Attack on No-Reference Image Quality Assessment from Two Perspectives
Comments: Submitted to a conference
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Deep neural networks have demonstrated impressive success in No-Reference Image Quality Assessment (NR-IQA). However, recent researches highlight the vulnerability of NR-IQA models to subtle adversarial perturbations, leading to inconsistencies between model predictions and subjective ratings. Current adversarial attacks, however, focus on perturbing predicted scores of individual images, neglecting the crucial aspect of inter-score correlation relationships within an entire image set. Meanwhile, it is important to note that the correlation, like ranking correlation, plays a significant role in NR-IQA tasks. To comprehensively explore the robustness of NR-IQA models, we introduce a new framework of correlation-error-based attacks that perturb both the correlation within an image set and score changes on individual images. Our research primarily focuses on ranking-related correlation metrics like Spearman's Rank-Order Correlation Coefficient (SROCC) and prediction error-related metrics like Mean Squared Error (MSE). As an instantiation, we propose a practical two-stage SROCC-MSE-Attack (SMA) that initially optimizes target attack scores for the entire image set and then generates adversarial examples guided by these scores. Experimental results demonstrate that our SMA method not only significantly disrupts the SROCC to negative values but also maintains a considerable change in the scores of individual images. Meanwhile, it exhibits state-of-the-art performance across metrics with different categories. Our method provides a new perspective on the robustness of NR-IQA models.

[161]  arXiv:2404.13288 (cross-list from cs.RO) [pdf, other]
Title: PoseINN: Realtime Visual-based Pose Regression and Localization with Invertible Neural Networks
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Estimating ego-pose from cameras is an important problem in robotics with applications ranging from mobile robotics to augmented reality. While SOTA models are becoming increasingly accurate, they can still be unwieldy due to high computational costs. In this paper, we propose to solve the problem by using invertible neural networks (INN) to find the mapping between the latent space of images and poses for a given scene. Our model achieves similar performance to the SOTA while being faster to train and only requiring offline rendering of low-resolution synthetic data. By using normalizing flows, the proposed method also provides uncertainty estimation for the output. We also demonstrated the efficiency of this method by deploying the model on a mobile robot.

[162]  arXiv:2404.13330 (cross-list from eess.IV) [pdf, other]
Title: SEGSRNet for Stereo-Endoscopic Image Super-Resolution and Surgical Instrument Segmentation
Comments: Paper accepted for Presentation in 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBS), Orlando, Florida, USA
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

SEGSRNet addresses the challenge of precisely identifying surgical instruments in low-resolution stereo endoscopic images, a common issue in medical imaging and robotic surgery. Our innovative framework enhances image clarity and segmentation accuracy by applying state-of-the-art super-resolution techniques before segmentation. This ensures higher-quality inputs for more precise segmentation. SEGSRNet combines advanced feature extraction and attention mechanisms with spatial processing to sharpen image details, which is significant for accurate tool identification in medical images. Our proposed model outperforms current models including Dice, IoU, PSNR, and SSIM, SEGSRNet where it produces clearer and more accurate images for stereo endoscopic surgical imaging. SEGSRNet can provide image resolution and precise segmentation which can significantly enhance surgical accuracy and patient care outcomes.

[163]  arXiv:2404.13372 (cross-list from eess.IV) [pdf, other]
Title: HybridFlow: Infusing Continuity into Masked Codebook for Extreme Low-Bitrate Image Compression
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

This paper investigates the challenging problem of learned image compression (LIC) with extreme low bitrates. Previous LIC methods based on transmitting quantized continuous features often yield blurry and noisy reconstruction due to the severe quantization loss. While previous LIC methods based on learned codebooks that discretize visual space usually give poor-fidelity reconstruction due to the insufficient representation power of limited codewords in capturing faithful details. We propose a novel dual-stream framework, HyrbidFlow, which combines the continuous-feature-based and codebook-based streams to achieve both high perceptual quality and high fidelity under extreme low bitrates. The codebook-based stream benefits from the high-quality learned codebook priors to provide high quality and clarity in reconstructed images. The continuous feature stream targets at maintaining fidelity details. To achieve the ultra low bitrate, a masked token-based transformer is further proposed, where we only transmit a masked portion of codeword indices and recover the missing indices through token generation guided by information from the continuous feature stream. We also develop a bridging correction network to merge the two streams in pixel decoding for final image reconstruction, where the continuous stream features rectify biases of the codebook-based pixel decoder to impose reconstructed fidelity details. Experimental results demonstrate superior performance across several datasets under extremely low bitrates, compared with existing single-stream codebook-based or continuous-feature-based LIC methods.

[164]  arXiv:2404.13386 (cross-list from eess.IV) [pdf, ps, other]
Title: SSVT: Self-Supervised Vision Transformer For Eye Disease Diagnosis Based On Fundus Images
Comments: ISBI 2024
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Machine learning-based fundus image diagnosis technologies trigger worldwide interest owing to their benefits such as reducing medical resource power and providing objective evaluation results. However, current methods are commonly based on supervised methods, bringing in a heavy workload to biomedical staff and hence suffering in expanding effective databases. To address this issue, in this article, we established a label-free method, name 'SSVT',which can automatically analyze un-labeled fundus images and generate high evaluation accuracy of 97.0% of four main eye diseases based on six public datasets and two datasets collected by Beijing Tongren Hospital. The promising results showcased the effectiveness of the proposed unsupervised learning method, and the strong application potential in biomedical resource shortage regions to improve global eye health.

[165]  arXiv:2404.13388 (cross-list from eess.IV) [pdf, ps, other]
Title: Diagnosis of Multiple Fundus Disorders Amidst a Scarcity of Medical Experts Via Self-supervised Machine Learning
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Fundus diseases are major causes of visual impairment and blindness worldwide, especially in underdeveloped regions, where the shortage of ophthalmologists hinders timely diagnosis. AI-assisted fundus image analysis has several advantages, such as high accuracy, reduced workload, and improved accessibility, but it requires a large amount of expert-annotated data to build reliable models. To address this dilemma, we propose a general self-supervised machine learning framework that can handle diverse fundus diseases from unlabeled fundus images. Our method's AUC surpasses existing supervised approaches by 15.7%, and even exceeds performance of a single human expert. Furthermore, our model adapts well to various datasets from different regions, races, and heterogeneous image sources or qualities from multiple cameras or devices. Our method offers a label-free general framework to diagnose fundus diseases, which could potentially benefit telehealth programs for early screening of people at risk of vision loss.

[166]  arXiv:2404.13452 (cross-list from eess.IV) [pdf, other]
Title: Cut-FUNQUE: An Objective Quality Model for Compressed Tone-Mapped High Dynamic Range Videos
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

High Dynamic Range (HDR) videos have enjoyed a surge in popularity in recent years due to their ability to represent a wider range of contrast and color than Standard Dynamic Range (SDR) videos. Although HDR video capture has seen increasing popularity because of recent flagship mobile phones such as Apple iPhones, Google Pixels, and Samsung Galaxy phones, a broad swath of consumers still utilize legacy SDR displays that are unable to display HDR videos. As result, HDR videos must be processed, i.e., tone-mapped, before streaming to a large section of SDR-capable video consumers. However, server-side tone-mapping involves automating decisions regarding the choices of tone-mapping operators (TMOs) and their parameters to yield high-fidelity outputs. Moreover, these choices must be balanced against the effects of lossy compression, which is ubiquitous in streaming scenarios. In this work, we develop a novel, efficient model of objective video quality named Cut-FUNQUE that is able to accurately predict the visual quality of tone-mapped and compressed HDR videos. Finally, we evaluate Cut-FUNQUE on a large-scale crowdsourced database of such videos and show that it achieves state-of-the-art accuracy.

[167]  arXiv:2404.13474 (cross-list from cs.RO) [pdf, other]
Title: Composing Pre-Trained Object-Centric Representations for Robotics From "What" and "Where" Foundation Models
Comments: ICRA 2024. Project website: this https URL
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

There have recently been large advances both in pre-training visual representations for robotic control and segmenting unknown category objects in general images. To leverage these for improved robot learning, we propose $\textbf{POCR}$, a new framework for building pre-trained object-centric representations for robotic control. Building on theories of "what-where" representations in psychology and computer vision, we use segmentations from a pre-trained model to stably locate across timesteps, various entities in the scene, capturing "where" information. To each such segmented entity, we apply other pre-trained models that build vector descriptions suitable for robotic control tasks, thus capturing "what" the entity is. Thus, our pre-trained object-centric representations for control are constructed by appropriately combining the outputs of off-the-shelf pre-trained models, with no new training. On various simulated and real robotic tasks, we show that imitation policies for robotic manipulators trained on POCR achieve better performance and systematic generalization than state of the art pre-trained representations for robotics, as well as prior object-centric representations that are typically trained from scratch.

[168]  arXiv:2404.13478 (cross-list from cs.RO) [pdf, other]
Title: Deep SE(3)-Equivariant Geometric Reasoning for Precise Placement Tasks
Comments: Published at International Conference on Representation Learning (ICLR 2024)
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Many robot manipulation tasks can be framed as geometric reasoning tasks, where an agent must be able to precisely manipulate an object into a position that satisfies the task from a set of initial conditions. Often, task success is defined based on the relationship between two objects - for instance, hanging a mug on a rack. In such cases, the solution should be equivariant to the initial position of the objects as well as the agent, and invariant to the pose of the camera. This poses a challenge for learning systems which attempt to solve this task by learning directly from high-dimensional demonstrations: the agent must learn to be both equivariant as well as precise, which can be challenging without any inductive biases about the problem. In this work, we propose a method for precise relative pose prediction which is provably SE(3)-equivariant, can be learned from only a few demonstrations, and can generalize across variations in a class of objects. We accomplish this by factoring the problem into learning an SE(3) invariant task-specific representation of the scene and then interpreting this representation with novel geometric reasoning layers which are provably SE(3) equivariant. We demonstrate that our method can yield substantially more precise placement predictions in simulated placement tasks than previous methods trained with the same amount of data, and can accurately represent relative placement relationships data collected from real-world demonstrations. Supplementary information and videos can be found at https://sites.google.com/view/reldist-iclr-2023.

[169]  arXiv:2404.13484 (cross-list from eess.IV) [pdf, other]
Title: Joint Quality Assessment and Example-Guided Image Processing by Disentangling Picture Appearance from Content
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

The deep learning revolution has strongly impacted low-level image processing tasks such as style/domain transfer, enhancement/restoration, and visual quality assessments. Despite often being treated separately, the aforementioned tasks share a common theme of understanding, editing, or enhancing the appearance of input images without modifying the underlying content. We leverage this observation to develop a novel disentangled representation learning method that decomposes inputs into content and appearance features. The model is trained in a self-supervised manner and we use the learned features to develop a new quality prediction model named DisQUE. We demonstrate through extensive evaluations that DisQUE achieves state-of-the-art accuracy across quality prediction tasks and distortion types. Moreover, we demonstrate that the same features may also be used for image processing tasks such as HDR tone mapping, where the desired output characteristics may be tuned using example input-output pairs.

[170]  arXiv:2404.13521 (cross-list from cs.HC) [pdf, other]
Title: Graph4GUI: Graph Neural Networks for Representing Graphical User Interfaces
Comments: 18 pages
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Present-day graphical user interfaces (GUIs) exhibit diverse arrangements of text, graphics, and interactive elements such as buttons and menus, but representations of GUIs have not kept up. They do not encapsulate both semantic and visuo-spatial relationships among elements. To seize machine learning's potential for GUIs more efficiently, Graph4GUI exploits graph neural networks to capture individual elements' properties and their semantic-visuo-spatial constraints in a layout. The learned representation demonstrated its effectiveness in multiple tasks, especially generating designs in a challenging GUI autocompletion task, which involved predicting the positions of remaining unplaced elements in a partially completed GUI. The new model's suggestions showed alignment and visual appeal superior to the baseline method and received higher subjective ratings for preference. Furthermore, we demonstrate the practical benefits and efficiency advantages designers perceive when utilizing our model as an autocompletion plug-in.

[171]  arXiv:2404.13537 (cross-list from eess.IV) [pdf, other]
Title: Bracketing Image Restoration and Enhancement with High-Low Frequency Decomposition
Comments: This paper is accepted by CVPR 2024 Workshop
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

In real-world scenarios, due to a series of image degradations, obtaining high-quality, clear content photos is challenging. While significant progress has been made in synthesizing high-quality images, previous methods for image restoration and enhancement often overlooked the characteristics of different degradations. They applied the same structure to address various types of degradation, resulting in less-than-ideal restoration outcomes. Inspired by the notion that high/low frequency information is applicable to different degradations, we introduce HLNet, a Bracketing Image Restoration and Enhancement method based on high-low frequency decomposition. Specifically, we employ two modules for feature extraction: shared weight modules and non-shared weight modules. In the shared weight modules, we use SCConv to extract common features from different degradations. In the non-shared weight modules, we introduce the High-Low Frequency Decomposition Block (HLFDB), which employs different methods to handle high-low frequency information, enabling the model to address different degradations more effectively. Compared to other networks, our method takes into account the characteristics of different degradations, thus achieving higher-quality image restoration.

[172]  arXiv:2404.13640 (cross-list from cs.MM) [pdf, other]
Title: Beyond Alignment: Blind Video Face Restoration via Parsing-Guided Temporal-Coherent Transformer
Comments: 9 pages
Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Multiple complex degradations are coupled in low-quality video faces in the real world. Therefore, blind video face restoration is a highly challenging ill-posed problem, requiring not only hallucinating high-fidelity details but also enhancing temporal coherence across diverse pose variations. Restoring each frame independently in a naive manner inevitably introduces temporal incoherence and artifacts from pose changes and keypoint localization errors. To address this, we propose the first blind video face restoration approach with a novel parsing-guided temporal-coherent transformer (PGTFormer) without pre-alignment. PGTFormer leverages semantic parsing guidance to select optimal face priors for generating temporally coherent artifact-free results. Specifically, we pre-train a temporal-spatial vector quantized auto-encoder on high-quality video face datasets to extract expressive context-rich priors. Then, the temporal parse-guided codebook predictor (TPCP) restores faces in different poses based on face parsing context cues without performing face pre-alignment. This strategy reduces artifacts and mitigates jitter caused by cumulative errors from face pre-alignment. Finally, the temporal fidelity regulator (TFR) enhances fidelity through temporal feature interaction and improves video temporal consistency. Extensive experiments on face videos show that our method outperforms previous face restoration baselines. The code will be released on \href{https://github.com/kepengxu/PGTFormer}{https://github.com/kepengxu/PGTFormer}.

[173]  arXiv:2404.13693 (cross-list from eess.IV) [pdf, other]
Title: PV-S3: Advancing Automatic Photovoltaic Defect Detection using Semi-Supervised Semantic Segmentation of Electroluminescence Images
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Photovoltaic (PV) systems allow us to tap into all abundant solar energy, however they require regular maintenance for high efficiency and to prevent degradation. Traditional manual health check, using Electroluminescence (EL) imaging, is expensive and logistically challenging making automated defect detection essential. Current automation approaches require extensive manual expert labeling, which is time-consuming, expensive, and prone to errors. We propose PV-S3 (Photovoltaic-Semi Supervised Segmentation), a Semi-Supervised Learning approach for semantic segmentation of defects in EL images that reduces reliance on extensive labeling. PV-S3 is a Deep learning model trained using a few labeled images along with numerous unlabeled images. We introduce a novel Semi Cross-Entropy loss function to train PV-S3 which addresses the challenges specific to automated PV defect detection, such as diverse defect types and class imbalance. We evaluate PV-S3 on multiple datasets and demonstrate its effectiveness and adaptability. With merely 20% labeled samples, we achieve an absolute improvement of 9.7% in IoU, 29.9% in Precision, 12.75% in Recall, and 20.42% in F1-Score over prior state-of-the-art supervised method (which uses 100% labeled samples) on UCF-EL dataset (largest dataset available for semantic segmentation of EL images) showing improvement in performance while reducing the annotation costs by 80%.

[174]  arXiv:2404.13704 (cross-list from eess.IV) [pdf, other]
Title: PEMMA: Parameter-Efficient Multi-Modal Adaptation for Medical Image Segmentation
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Imaging modalities such as Computed Tomography (CT) and Positron Emission Tomography (PET) are key in cancer detection, inspiring Deep Neural Networks (DNN) models that merge these scans for tumor segmentation. When both CT and PET scans are available, it is common to combine them as two channels of the input to the segmentation model. However, this method requires both scan types during training and inference, posing a challenge due to the limited availability of PET scans, thereby sometimes limiting the process to CT scans only. Hence, there is a need to develop a flexible DNN architecture that can be trained/updated using only CT scans but can effectively utilize PET scans when they become available. In this work, we propose a parameter-efficient multi-modal adaptation (PEMMA) framework for lightweight upgrading of a transformer-based segmentation model trained only on CT scans to also incorporate PET scans. The benefits of the proposed approach are two-fold. Firstly, we leverage the inherent modularity of the transformer architecture and perform low-rank adaptation (LoRA) of the attention weights to achieve parameter-efficient adaptation. Secondly, since the PEMMA framework attempts to minimize cross modal entanglement, it is possible to subsequently update the combined model using only one modality, without causing catastrophic forgetting of the other modality. Our proposed method achieves comparable results with the performance of early fusion techniques with just 8% of the trainable parameters, especially with a remarkable +28% improvement on the average dice score on PET scans when trained on a single modality.

[175]  arXiv:2404.13733 (cross-list from cs.LG) [pdf, other]
Title: Elucidating the Design Space of Dataset Condensation
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Dataset condensation, a concept within data-centric learning, efficiently transfers critical attributes from an original dataset to a synthetic version, maintaining both diversity and realism. This approach significantly improves model training efficiency and is adaptable across multiple application areas. Previous methods in dataset condensation have faced challenges: some incur high computational costs which limit scalability to larger datasets (e.g., MTT, DREAM, and TESLA), while others are restricted to less optimal design spaces, which could hinder potential improvements, especially in smaller datasets (e.g., SRe2L, G-VBSM, and RDED). To address these limitations, we propose a comprehensive design framework that includes specific, effective strategies like implementing soft category-aware matching and adjusting the learning rate schedule. These strategies are grounded in empirical evidence and theoretical backing. Our resulting approach, Elucidate Dataset Condensation (EDC), establishes a benchmark for both small and large-scale dataset condensation. In our testing, EDC achieves state-of-the-art accuracy, reaching 48.6% on ImageNet-1k with a ResNet-18 model at an IPC of 10, which corresponds to a compression ratio of 0.78%. This performance exceeds those of SRe2L, G-VBSM, and RDED by margins of 27.3%, 17.2%, and 6.6%, respectively.

[176]  arXiv:2404.13756 (cross-list from eess.IV) [pdf, other]
Title: BC-MRI-SEG: A Breast Cancer MRI Tumor Segmentation Benchmark
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Binary breast cancer tumor segmentation with Magnetic Resonance Imaging (MRI) data is typically trained and evaluated on private medical data, which makes comparing deep learning approaches difficult. We propose a benchmark (BC-MRI-SEG) for binary breast cancer tumor segmentation based on publicly available MRI datasets. The benchmark consists of four datasets in total, where two datasets are used for supervised training and evaluation, and two are used for zero-shot evaluation. Additionally we compare state-of-the-art (SOTA) approaches on our benchmark and provide an exhaustive list of available public breast cancer MRI datasets. The source code has been made available at https://irulenot.github.io/BC_MRI_SEG_Benchmark.

[177]  arXiv:2404.13767 (cross-list from cs.RO) [pdf, other]
Title: Autonomous Robot for Disaster Mapping and Victim Localization
Comments: Class final project for Northeastern University EECE 5550 Mobile Robotics Course
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

In response to the critical need for effective reconnaissance in disaster scenarios, this research article presents the design and implementation of a complete autonomous robot system using the Turtlebot3 with Robotic Operating System (ROS) Noetic. Upon deployment in closed, initially unknown environments, the system aims to generate a comprehensive map and identify any present 'victims' using AprilTags as stand-ins. We discuss our solution for search and rescue missions, while additionally exploring more advanced algorithms to improve search and rescue functionalities. We introduce a Cubature Kalman Filter to help reduce the mean squared error [m] for AprilTag localization and an information-theoretic exploration algorithm to expedite exploration in unknown environments. Just like turtles, our system takes it slow and steady, but when it's time to save the day, it moves at ninja-like speed! Despite Donatello's shell, he's no slowpoke - he zips through obstacles with the agility of a teenage mutant ninja turtle. So, hang on tight to your shells and get ready for a whirlwind of reconnaissance!
Full pipeline code https://github.com/rzhao5659/MRProject/tree/main
Exploration code https://github.com/rzhao5659/MRProject/tree/main

[178]  arXiv:2404.13784 (cross-list from cs.CR) [pdf, other]
Title: Iteratively Prompting Multimodal LLMs to Reproduce Natural and AI-Generated Images
Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

With the digital imagery landscape rapidly evolving, image stocks and AI-generated image marketplaces have become central to visual media. Traditional stock images now exist alongside innovative platforms that trade in prompts for AI-generated visuals, driven by sophisticated APIs like DALL-E 3 and Midjourney. This paper studies the possibility of employing multi-modal models with enhanced visual understanding to mimic the outputs of these platforms, introducing an original attack strategy. Our method leverages fine-tuned CLIP models, a multi-label classifier, and the descriptive capabilities of GPT-4V to create prompts that generate images similar to those available in marketplaces and from premium stock image providers, yet at a markedly lower expense. In presenting this strategy, we aim to spotlight a new class of economic and security considerations within the realm of digital imagery. Our findings, supported by both automated metrics and human assessment, reveal that comparable visual content can be produced for a fraction of the prevailing market prices ($0.23 - $0.27 per image), emphasizing the need for awareness and strategic discussions about the integrity of digital media in an increasingly AI-integrated landscape. Our work also contributes to the field by assembling a dataset consisting of approximately 19 million prompt-image pairs generated by the popular Midjourney platform, which we plan to release publicly.

[179]  arXiv:2404.13874 (cross-list from cs.CL) [pdf, other]
Title: VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models
Comments: Work in process
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs, undermining their reliability. A comprehensive quantitative evaluation is necessary to identify and understand the extent of hallucinations in these models. However, existing benchmarks are often limited in scope, focusing mainly on object hallucinations. Furthermore, current evaluation methods struggle to effectively address the subtle semantic distinctions between model outputs and reference data, as well as the balance between hallucination and informativeness. To address these issues, we introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases. Moreover, we propose an large language model (LLM)-based two-stage evaluation framework that generalizes the popular CHAIR metric and incorporates both faithfulness and coverage into the evaluation. Experiments on 10 established LVLMs demonstrate that our evaluation metric is more comprehensive and better correlated with humans than existing work when evaluating on our challenging human annotated benchmark dataset. Our work also highlights the critical balance between faithfulness and coverage of model outputs, and encourages future works to address hallucinations in LVLMs while keeping their outputs informative.

[180]  arXiv:2404.13884 (cross-list from eess.IV) [pdf, ps, other]
Title: MambaUIE&SR: Unraveling the Ocean's Secrets with Only 2.8 FLOPs
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Underwater Image Enhancement (UIE) techniques aim to address the problem of underwater image degradation due to light absorption and scattering. In recent years, both Convolution Neural Network (CNN)-based and Transformer-based methods have been widely explored. In addition, combining CNN and Transformer can effectively combine global and local information for enhancement. However, this approach is still affected by the secondary complexity of the Transformer and cannot maximize the performance. Recently, the state-space model (SSM) based architecture Mamba has been proposed, which excels in modeling long distances while maintaining linear complexity. This paper explores the potential of this SSM-based model for UIE from both efficiency and effectiveness perspectives. However, the performance of directly applying Mamba is poor because local fine-grained features, which are crucial for image enhancement, cannot be fully utilized. Specifically, we customize the MambaUIE architecture for efficient UIE. Specifically, we introduce visual state space (VSS) blocks to capture global contextual information at the macro level while mining local information at the micro level. Also, for these two kinds of information, we propose a Dynamic Interaction Block (DIB) and Spatial feed-forward Network (SGFN) for intra-block feature aggregation. MambaUIE is able to efficiently synthesize global and local information and maintains a very small number of parameters with high accuracy. Experiments on UIEB datasets show that our method reduces GFLOPs by 67.4% (2.715G) relative to the SOTA method. To the best of our knowledge, this is the first UIE model constructed based on SSM that breaks the limitation of FLOPs on accuracy in UIE. The official repository of MambaUIE at https://github.com/1024AILab/MambaUIE.

[181]  arXiv:2404.13904 (cross-list from cs.LG) [pdf, other]
Title: Deep Regression Representation Learning with Topology
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Most works studying representation learning focus only on classification and neglect regression. Yet, the learning objectives and therefore the representation topologies of the two tasks are fundamentally different: classification targets class separation, leading to disconnected representations, whereas regression requires ordinality with respect to the target, leading to continuous representations. We thus wonder how the effectiveness of a regression representation is influenced by its topology, with evaluation based on the Information Bottleneck (IB) principle.
The IB principle is an important framework that provides principles for learning effectiveness representations. We establish two connections between it and the topology of regression representations. The first connection reveals that a lower intrinsic dimension of the feature space implies a reduced complexity of the representation Z. This complexity can be quantified as the conditional entropy of Z on the target space Y and serves as an upper bound on the generalization error. The second connection suggests learning a feature space that is topologically similar to the target space will better align with the IB principle. Based on these two connections, we introduce PH-Reg, a regularizer specific to regression that matches the intrinsic dimension and topology of the feature space with the target space. Experiments on synthetic and real-world regression tasks demonstrate the benefits of PH-Reg.

[182]  arXiv:2404.13929 (cross-list from eess.IV) [pdf, other]
Title: Exploring Kinetic Curves Features for the Classification of Benign and Malignant Breast Lesions in DCE-MRI
Comments: 6 pages, 8 figures, conference
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Breast cancer is the most common malignant tumor among women and the second cause of cancer-related death. Early diagnosis in clinical practice is crucial for timely treatment and prognosis. Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) has revealed great usability in the preoperative diagnosis and assessing therapy effects thanks to its capability to reflect the morphology and dynamic characteristics of breast lesions. However, most existing computer-assisted diagnosis algorithms only consider conventional radiomic features when classifying benign and malignant lesions in DCE-MRI. In this study, we propose to fully leverage the dynamic characteristics from the kinetic curves as well as the radiomic features to boost the classification accuracy of benign and malignant breast lesions. The proposed method is a fully automated solution by directly analyzing the 3D features from the DCE-MRI. The proposed method is evaluated on an in-house dataset including 200 DCE-MRI scans with 298 breast tumors (172 benign and 126 malignant tumors), achieving favorable classification accuracy with an area under curve (AUC) of 0.94. By simultaneously considering the dynamic and radiomic features, it is beneficial to effectively distinguish between benign and malignant breast lesions.

[183]  arXiv:2404.13993 (cross-list from cs.MM) [pdf, other]
Title: Zero-Shot Character Identification and Speaker Prediction in Comics via Iterative Multimodal Fusion
Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)

Recognizing characters and predicting speakers of dialogue are critical for comic processing tasks, such as voice generation or translation. However, because characters vary by comic title, supervised learning approaches like training character classifiers which require specific annotations for each comic title are infeasible. This motivates us to propose a novel zero-shot approach, allowing machines to identify characters and predict speaker names based solely on unannotated comic images. In spite of their importance in real-world applications, these task have largely remained unexplored due to challenges in story comprehension and multimodal integration. Recent large language models (LLMs) have shown great capability for text understanding and reasoning, while their application to multimodal content analysis is still an open problem. To address this problem, we propose an iterative multimodal framework, the first to employ multimodal information for both character identification and speaker prediction tasks. Our experiments demonstrate the effectiveness of the proposed framework, establishing a robust baseline for these tasks. Furthermore, since our method requires no training data or annotations, it can be used as-is on any comic series.

[184]  arXiv:2404.14006 (cross-list from cs.LG) [pdf, other]
Title: Distilled Datamodel with Reverse Gradient Matching
Comments: Accepted by CVPR2024
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

The proliferation of large-scale AI models trained on extensive datasets has revolutionized machine learning. With these models taking on increasingly central roles in various applications, the need to understand their behavior and enhance interpretability has become paramount. To investigate the impact of changes in training data on a pre-trained model, a common approach is leave-one-out retraining. This entails systematically altering the training dataset by removing specific samples to observe resulting changes within the model. However, retraining the model for each altered dataset presents a significant computational challenge, given the need to perform this operation for every dataset variation. In this paper, we introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages. During the offline training phase, we approximate the influence of training data on the target model through a distilled synset, formulated as a reversed gradient matching problem. For online evaluation, we expedite the leave-one-out process using the synset, which is then utilized to compute the attribution matrix based on the evaluation objective. Experimental evaluations, including training data attribution and assessments of data quality, demonstrate that our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.

[185]  arXiv:2404.14016 (cross-list from cs.LG) [pdf, other]
Title: Ungeneralizable Examples
Comments: Accepted by CVPR2024
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

The training of contemporary deep learning models heavily relies on publicly available data, posing a risk of unauthorized access to online data and raising concerns about data privacy. Current approaches to creating unlearnable data involve incorporating small, specially designed noises, but these methods strictly limit data usability, overlooking its potential usage in authorized scenarios. In this paper, we extend the concept of unlearnable data to conditional data learnability and introduce \textbf{U}n\textbf{G}eneralizable \textbf{E}xamples (UGEs). UGEs exhibit learnability for authorized users while maintaining unlearnability for potential hackers. The protector defines the authorized network and optimizes UGEs to match the gradients of the original data and its ungeneralizable version, ensuring learnability. To prevent unauthorized learning, UGEs are trained by maximizing a designated distance loss in a common feature space. Additionally, to further safeguard the authorized side from potential attacks, we introduce additional undistillation optimization. Experimental results on multiple datasets and various networks demonstrate that the proposed UGEs framework preserves data usability while reducing training performance on hacker networks, even under different types of attacks.

[186]  arXiv:2404.14064 (cross-list from cs.LG) [pdf, other]
Title: Multi-view Disentanglement for Reinforcement Learning with Multiple Cameras
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

The performance of image-based Reinforcement Learning (RL) agents can vary depending on the position of the camera used to capture the images. Training on multiple cameras simultaneously, including a first-person egocentric camera, can leverage information from different camera perspectives to improve the performance of RL. However, hardware constraints may limit the availability of multiple cameras in real-world deployment. Additionally, cameras may become damaged in the real-world preventing access to all cameras that were used during training. To overcome these hardware constraints, we propose Multi-View Disentanglement (MVD), which uses multiple cameras to learn a policy that achieves zero-shot generalisation to any single camera from the training set. Our approach is a self-supervised auxiliary task for RL that learns a disentangled representation from multiple cameras, with a shared representation that is aligned across all cameras to allow generalisation to a single camera, and a private representation that is camera-specific. We show experimentally that an RL agent trained on a single third-person camera is unable to learn an optimal policy in many control tasks; but, our approach, benefiting from multiple cameras during training, is able to solve the task using only the same single third-person camera.

[187]  arXiv:2404.14076 (cross-list from cs.LG) [pdf, other]
Title: Noise contrastive estimation with soft targets for conditional models
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Soft targets combined with the cross-entropy loss have shown to improve generalization performance of deep neural networks on supervised classification tasks. The standard cross-entropy loss however assumes data to be categorically distributed, which may often not be the case in practice. In contrast, InfoNCE does not rely on such an explicit assumption but instead implicitly estimates the true conditional through negative sampling. Unfortunately, it cannot be combined with soft targets in its standard formulation, hindering its use in combination with sophisticated training strategies. In this paper, we address this limitation by proposing a principled loss function that is compatible with probabilistic targets. Our new soft target InfoNCE loss is conceptually simple, efficient to compute, and can be derived within the framework of noise contrastive estimation. Using a toy example, we demonstrate shortcomings of the categorical distribution assumption of cross-entropy, and discuss implications of sampling from soft distributions. We observe that soft target InfoNCE performs on par with strong soft target cross-entropy baselines and outperforms hard target NLL and InfoNCE losses on popular benchmarks, including ImageNet. Finally, we provide a simple implementation of our loss, geared towards supervised classification and fully compatible with deep classification model trained with cross-entropy.

[188]  arXiv:2404.14077 (cross-list from cs.RO) [pdf, ps, other]
Title: Research on Robot Path Planning Based on Reinforcement Learning
Authors: Wang Ruiqi
Comments: My undergrad final year project report, 44 pages and 15 figures
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

This project has conducted research on robot path planning based on Visual SLAM. The main work of this project is as follows: (1) Construction of Visual SLAM system. Research has been conducted on the basic architecture of Visual SLAM. A Visual SLAM system is developed based on ORB-SLAM3 system, which can conduct dense point cloud mapping. (2) The map suitable for two-dimensional path planning is obtained through map conversion. This part converts the dense point cloud map obtained by Visual SLAM system into an octomap and then performs projection transformation to the grid map. The map conversion converts the dense point cloud map containing a large amount of redundant map information into an extremely lightweight grid map suitable for path planning. (3) Research on path planning algorithm based on reinforcement learning. This project has conducted experimental comparisons between the Q-learning algorithm, the DQN algorithm, and the SARSA algorithm, and found that DQN is the algorithm with the fastest convergence and best performance in high-dimensional complex environments. This project has conducted experimental verification of the Visual SLAM system in a simulation environment. The experimental results obtained based on open-source dataset and self-made dataset prove the feasibility and effectiveness of the designed Visual SLAM system. At the same time, this project has also conducted comparative experiments on the three reinforcement learning algorithms under the same experimental condition to obtain the optimal algorithm under the experimental condition.

[189]  arXiv:2404.14117 (cross-list from cs.RO) [pdf, other]
Title: Hierarchical localization with panoramic views and triplet loss functions
Comments: This work has been submitted to the Artificial Intelligence Journal (Ed. Elsevier) for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

The main objective of this paper is to address the mobile robot localization problem with Triplet Convolutional Neural Networks and test their robustness against changes of the lighting conditions. We have used omnidirectional images from real indoor environments captured in dynamic conditions that have been converted to panoramic format. Two approaches are proposed to address localization by means of triplet neural networks. First, hierarchical localization, which consists in estimating the robot position in two stages: a coarse localization, which involves a room retrieval task, and a fine localization is addressed by means of image retrieval in the previously selected room. Second, global localization, which consists in estimating the position of the robot inside the entire map in a unique step. Besides, an exhaustive study of the loss function influence on the network learning process has been made. The experimental section proves that triplet neural networks are an efficient and robust tool to address the localization of mobile robots in indoor environments, considering real operation conditions.

[190]  arXiv:2404.14281 (cross-list from cs.RO) [pdf, other]
Title: Fast and Robust Normal Estimation for Sparse LiDAR Scans
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Light Detection and Ranging (LiDAR) technology has proven to be an important part of many robotics systems. Surface normals estimated from LiDAR data are commonly used for a variety of tasks in such systems. As most of the today's mechanical LiDAR sensors produce sparse data, estimating normals from a single scan in a robust manner poses difficulties.
In this paper, we address the problem of estimating normals for sparse LiDAR data avoiding the typical issues of smoothing out the normals in high curvature areas.
Mechanical LiDARs rotate a set of rigidly mounted lasers. One firing of such a set of lasers produces an array of points where each point's neighbor is known due to the known firing pattern of the scanner. We use this knowledge to connect these points to their neighbors and label them using the angles of the lines connecting them. When estimating normals at these points, we only consider points with the same label as neighbors. This allows us to avoid estimating normals in high curvature areas.
We evaluate our approach on various data, both self-recorded and publicly available, acquired using various sparse LiDAR sensors. We show that using our method for normal estimation leads to normals that are more robust in areas with high curvature which leads to maps of higher quality. We also show that our method only incurs a constant factor runtime overhead with respect to a lightweight baseline normal estimation procedure and is therefore suited for operation in computationally demanding environments.

[191]  arXiv:2404.14322 (cross-list from eess.IV) [pdf, ps, other]
Title: A Novel Approach to Chest X-ray Lung Segmentation Using U-net and Modified Convolutional Block Attention Module
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Lung segmentation in chest X-ray images is of paramount importance as it plays a crucial role in the diagnosis and treatment of various lung diseases. This paper presents a novel approach for lung segmentation in chest X-ray images by integrating U-net with attention mechanisms. The proposed method enhances the U-net architecture by incorporating a Convolutional Block Attention Module (CBAM), which unifies three distinct attention mechanisms: channel attention, spatial attention, and pixel attention. The channel attention mechanism enables the model to concentrate on the most informative features across various channels. The spatial attention mechanism enhances the model's precision in localization by focusing on significant spatial locations. Lastly, the pixel attention mechanism empowers the model to focus on individual pixels, further refining the model's focus and thereby improving the accuracy of segmentation. The adoption of the proposed CBAM in conjunction with the U-net architecture marks a significant advancement in the field of medical imaging, with potential implications for improving diagnostic precision and patient outcomes. The efficacy of this method is validated against contemporary state-of-the-art techniques, showcasing its superiority in segmentation performance.

[192]  arXiv:2404.14326 (cross-list from cs.LG) [pdf, ps, other]
Title: Machine Learning Techniques for MRI Data Processing at Expanding Scale
Authors: Taro Langner
Comments: Book chapter pre-print
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Imaging sites around the world generate growing amounts of medical scan data with ever more versatile and affordable technology. Large-scale studies acquire MRI for tens of thousands of participants, together with metadata ranging from lifestyle questionnaires to biochemical assays, genetic analyses and more. These large datasets encode substantial information about human health and hold considerable potential for machine learning training and analysis. This chapter examines ongoing large-scale studies and the challenge of distribution shifts between them. Transfer learning for overcoming such shifts is discussed, together with federated learning for safe access to distributed training data securely held at multiple institutions. Finally, representation learning is reviewed as a methodology for encoding embeddings that express abstract relationships in multi-modal input formats.

[193]  arXiv:2404.14388 (cross-list from cs.LG) [pdf, other]
Title: STROOBnet Optimization via GPU-Accelerated Proximal Recurrence Strategies
Comments: 10 pages, 17 figures, 2023 IEEE International Conference on Big Data (BigData)
Journal-ref: 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 2023, pp. 2920-2929
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)

Spatiotemporal networks' observational capabilities are crucial for accurate data gathering and informed decisions across multiple sectors. This study focuses on the Spatiotemporal Ranged Observer-Observable Bipartite Network (STROOBnet), linking observational nodes (e.g., surveillance cameras) to events within defined geographical regions, enabling efficient monitoring. Using data from Real-Time Crime Camera (RTCC) systems and Calls for Service (CFS) in New Orleans, where RTCC combats rising crime amidst reduced police presence, we address the network's initial observational imbalances. Aiming for uniform observational efficacy, we propose the Proximal Recurrence approach. It outperformed traditional clustering methods like k-means and DBSCAN by offering holistic event frequency and spatial consideration, enhancing observational coverage.

[194]  arXiv:2404.14394 (cross-list from cs.AI) [pdf, other]
Title: A Multimodal Automated Interpretability Agent
Comments: 25 pages, 13 figures
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

This paper describes MAIA, a Multimodal Automated Interpretability Agent. MAIA is a system that uses neural models to automate neural model understanding tasks like feature interpretation and failure mode discovery. It equips a pre-trained vision-language model with a set of tools that support iterative experimentation on subcomponents of other models to explain their behavior. These include tools commonly used by human interpretability researchers: for synthesizing and editing inputs, computing maximally activating exemplars from real-world datasets, and summarizing and describing experimental results. Interpretability experiments proposed by MAIA compose these tools to describe and explain system behavior. We evaluate applications of MAIA to computer vision models. We first characterize MAIA's ability to describe (neuron-level) features in learned representations of images. Across several trained models and a novel dataset of synthetic vision neurons with paired ground-truth descriptions, MAIA produces descriptions comparable to those generated by expert human experimenters. We then show that MAIA can aid in two additional interpretability tasks: reducing sensitivity to spurious features, and automatically identifying inputs likely to be mis-classified.

Replacements for Tue, 23 Apr 24

[195]  arXiv:2106.14490 (replaced) [pdf, other]
Title: Making Images Real Again: A Comprehensive Survey on Deep Image Composition
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[196]  arXiv:2206.04406 (replaced) [pdf, other]
Title: Unsupervised Learning of the Total Variation Flow
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[197]  arXiv:2207.11720 (replaced) [pdf, other]
Title: Progressive Feature Learning for Realistic Cloth-Changing Gait Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[198]  arXiv:2209.05321 (replaced) [pdf, other]
Title: Deep Feature Statistics Mapping for Generalized Screen Content Image Quality Assessment
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[199]  arXiv:2212.03095 (replaced) [pdf, other]
Title: Interpretation of Neural Networks is Susceptible to Universal Adversarial Perturbations
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
[200]  arXiv:2212.11152 (replaced) [pdf, other]
Title: OpenPack: A Large-scale Dataset for Recognizing Packaging Works in IoT-enabled Logistic Environments
Journal-ref: Proceedings of IEEE International Conference on Pervasive Computing and Communications (PerCom), March 2024, pp. 90-97
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[201]  arXiv:2303.10772 (replaced) [pdf, other]
Title: Unsupervised Gait Recognition with Selective Fusion
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[202]  arXiv:2303.13959 (replaced) [pdf, other]
Title: Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion
Comments: IJCAI2024 (this https URL)
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[203]  arXiv:2304.05060 (replaced) [pdf, other]
Title: SPIRiT-Diffusion: Self-Consistency Driven Diffusion Model for Accelerated MRI
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[204]  arXiv:2306.00816 (replaced) [pdf, other]
Title: Versatile Backdoor Attack with Visible, Semantic, Sample-Specific, and Compatible Triggers
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
[205]  arXiv:2306.17469 (replaced) [pdf, other]
Title: Manga109Dialog: A Large-scale Dialogue Dataset for Comics Speaker Detection
Comments: Accepted to ICME2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[206]  arXiv:2307.06737 (replaced) [pdf, other]
Title: Improving 2D Human Pose Estimation in Rare Camera Views with Synthetic Data
Comments: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[207]  arXiv:2307.14593 (replaced) [pdf, other]
Title: FakeTracer: Catching Face-swap DeepFakes via Implanting Traces in Training
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[208]  arXiv:2308.13651 (replaced) [pdf, other]
Title: PCNN: Probable-Class Nearest-Neighbor Explanations Improve Fine-Grained Image Classification Accuracy for AIs and Humans
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
[209]  arXiv:2309.04891 (replaced) [pdf, other]
Title: How to Evaluate Semantic Communications for Images with ViTScore Metric?
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
[210]  arXiv:2309.08513 (replaced) [pdf, other]
Title: SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels
Comments: This work has been accepted by IJCV2023
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[211]  arXiv:2310.05108 (replaced) [pdf, other]
Title: Enhancing Representations through Heterogeneous Self-Supervised Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[212]  arXiv:2310.06214 (replaced) [pdf, other]
Title: CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding
Comments: ICLR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[213]  arXiv:2310.06629 (replaced) [pdf, other]
Title: EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention
Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[214]  arXiv:2310.10352 (replaced) [pdf, other]
Title: Semi-Supervised Crowd Counting with Contextual Modeling: Facilitating Holistic Understanding of Crowd Scenes
Comments: Accepted by TCSVT
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[215]  arXiv:2310.18737 (replaced) [pdf, other]
Title: Pre-training with Random Orthogonal Projection Image Modeling
Comments: Published as a conference paper at the International Conference on Learning Representations (ICLR) 2024. 19 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[216]  arXiv:2310.19540 (replaced) [pdf, other]
Title: IterInv: Iterative Inversion for Pixel-Level T2I Models
Comments: Accepted paper at ICME 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
[217]  arXiv:2311.17132 (replaced) [pdf, other]
Title: TransNeXt: Robust Foveal Visual Perception for Vision Transformers
Authors: Dai Shi
Comments: CVPR 2024 Camera-ready Version. Project Page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[218]  arXiv:2311.17241 (replaced) [pdf, other]
Title: End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames
Comments: Accepted to CVPR 2024. Camera-Ready Version
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[219]  arXiv:2311.17948 (replaced) [pdf, other]
Title: Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[220]  arXiv:2312.01431 (replaced) [pdf, other]
Title: D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[221]  arXiv:2312.02567 (replaced) [pdf, other]
Title: Think Twice Before Selection: Federated Evidential Active Learning for Medical Image Analysis with Domain Shifts
Comments: Accepted by CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[222]  arXiv:2312.02914 (replaced) [pdf, other]
Title: Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training
Comments: Accepted at CVPR 2024. 13 pages, 4 figures. Approved for public release: distribution unlimited
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[223]  arXiv:2312.05664 (replaced) [pdf, other]
Title: CoGS: Controllable Gaussian Splatting
Comments: CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[224]  arXiv:2312.07509 (replaced) [pdf, other]
Title: PEEKABOO: Interactive Video Generation via Masked-Diffusion
Comments: Project webpage - this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[225]  arXiv:2312.09069 (replaced) [pdf, other]
Title: PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion
Comments: Accepted by CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[226]  arXiv:2312.13328 (replaced) [pdf, other]
Title: NeLF-Pro: Neural Light Field Probes for Multi-Scale Novel View Synthesis
Comments: CVPR 2024 Conference Paper, Camera Ready Version
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[227]  arXiv:2312.14494 (replaced) [pdf, other]
Title: Revisiting Few-Shot Object Detection with Vision-Language Models
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[228]  arXiv:2401.03907 (replaced) [pdf, other]
Title: RoboFusion: Towards Robust Multi-Modal 3D Object Detection via SAM
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[229]  arXiv:2401.04727 (replaced) [pdf, other]
Title: Revisiting Adversarial Training at Scale
Comments: Accepted by CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[230]  arXiv:2401.09673 (replaced) [pdf, other]
Title: Artwork Protection Against Neural Style Transfer Using Locally Adaptive Adversarial Color Attack
Comments: 9 pages, 5 figures, 4 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[231]  arXiv:2401.11395 (replaced) [pdf, other]
Title: UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation
Comments: Accepted by IJCAI 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[232]  arXiv:2401.13516 (replaced) [pdf, other]
Title: Delocate: Detection and Localization for Deepfake Videos with Randomly-Located Tampered Traces
Comments: arXiv admin note: substantial text overlap with arXiv:2308.09921, arXiv:2305.05943
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
[233]  arXiv:2401.15902 (replaced) [pdf, other]
Title: A Concise but High-performing Network for Image Guided Depth Completion in Autonomous Driving
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[234]  arXiv:2402.01123 (replaced) [pdf, other]
Title: A Single Simple Patch is All You Need for AI-generated Image Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[235]  arXiv:2402.11631 (replaced) [pdf, other]
Title: Neuromorphic Face Analysis: a Survey
Comments: Submitted to Patter Recognition Letters
Subjects: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
[236]  arXiv:2402.11677 (replaced) [pdf, other]
Title: MultiCorrupt: A Multi-Modal Robustness Dataset and Benchmark of LiDAR-Camera Fusion for 3D Object Detection
Comments: Code: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[237]  arXiv:2402.18673 (replaced) [pdf, other]
Title: Trends, Applications, and Challenges in Human Attention Modelling
Comments: Accepted at IJCAI 2024 Survey Track
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[238]  arXiv:2403.01497 (replaced) [pdf, other]
Title: Learning A Physical-aware Diffusion Model Based on Transformer for Underwater Image Enhancement
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[239]  arXiv:2403.01644 (replaced) [pdf, other]
Title: OccFusion: A Straightforward and Effective Multi-Sensor Fusion Framework for 3D Occupancy Prediction
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[240]  arXiv:2403.01693 (replaced) [pdf, other]
Title: HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances
Comments: Revisions: 1. Added a link to project page in the abstract, 2. Updated references and related work, 3. Fixed some grammatical errors
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[241]  arXiv:2403.02611 (replaced) [pdf, other]
Title: A Unified Framework for Microscopy Defocus Deblur with Multi-Pyramid Transformer and Contrastive Learning
Comments: Accepted by CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[242]  arXiv:2403.04661 (replaced) [pdf, other]
Title: Dynamic Cross Attention for Audio-Visual Person Verification
Comments: Accepted to FG2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[243]  arXiv:2403.05146 (replaced) [pdf, other]
Title: Motion-Guided Dual-Camera Tracker for Low-Cost Skill Evaluation of Gastric Endoscopy
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[244]  arXiv:2403.05817 (replaced) [pdf, other]
Title: SAFDNet: A Simple and Effective Network for Fully Sparse 3D Object Detection
Comments: Accepted by CVPR 2024 (Oral)
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[245]  arXiv:2403.06098 (replaced) [pdf, other]
Title: VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models
Comments: The project (including the collected dataset VidProM and related code) is publicly available at this https URL under the CC-BY-NC 4.0 License
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
[246]  arXiv:2403.06973 (replaced) [pdf, other]
Title: Bayesian Diffusion Models for 3D Shape Reconstruction
Comments: Accepted to CVPR 2024; Project Page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[247]  arXiv:2403.10488 (replaced) [pdf, other]
Title: Joint Multimodal Transformer for Emotion Recognition in the Wild
Comments: 10 pages, 4 figures, 6 tables, CVPRw 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[248]  arXiv:2403.10518 (replaced) [pdf, other]
Title: Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives
Comments: Accepted by CVPR2024, Project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[249]  arXiv:2403.11120 (replaced) [pdf, other]
Title: Unifying Feature and Cost Aggregation with Transformers for Semantic and Visual Correspondence
Comments: Accepted by ICLR'24
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[250]  arXiv:2403.15977 (replaced) [pdf, other]
Title: Towards Two-Stream Foveation-based Active Vision Learning
Comments: Accepted version of the article, 18 pages, 14 figures
Journal-ref: IEEE Transactions on Cognitive and Developmental Systems, 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[251]  arXiv:2403.17881 (replaced) [pdf, other]
Title: Deepfake Generation and Detection: A Benchmark and Survey
Comments: We closely follow the latest developments in this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[252]  arXiv:2403.20260 (replaced) [pdf, other]
Title: Prototype-based Interpretable Breast Cancer Prediction Models: Analysis and Challenges
Comments: Accepted at World Conference on Explainable Artificial Intelligence; 21 pages, 5 figures, 3 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[253]  arXiv:2404.00257 (replaced) [src]
Title: YOLOOC: YOLO-based Open-Class Incremental Object Detection with Novel Class Discovery
Comments: Withdrawn because it was submitted without consent of the first author. In addition, this submission has some errors
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[254]  arXiv:2404.02148 (replaced) [pdf, other]
Title: Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of Orthogonal Diffusion Models
Comments: Technical Report
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[255]  arXiv:2404.02899 (replaced) [pdf, other]
Title: MatAtlas: Text-driven Consistent Geometry Texturing and Material Assignment
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
[256]  arXiv:2404.05238 (replaced) [pdf, other]
Title: Allowing humans to interactively guide machines where to look does not always improve human-AI team's classification accuracy
Comments: Accepted for presentation at the XAI4CV Workshop, part of the CVPR 2024 proceedings
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
[257]  arXiv:2404.07600 (replaced) [pdf, other]
Title: Implicit and Explicit Language Guidance for Diffusion-based Visual Perception
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[258]  arXiv:2404.07762 (replaced) [pdf, other]
Title: NeuroNCAP: Photorealistic Closed-loop Safety Testing for Autonomous Driving
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[259]  arXiv:2404.08544 (replaced) [pdf, other]
Title: Analyzing Decades-Long Environmental Changes in Namibia Using Archival Aerial Photography and Deep Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[260]  arXiv:2404.08965 (replaced) [pdf, other]
Title: Seeing Text in the Dark: Algorithm and Benchmark
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[261]  arXiv:2404.09105 (replaced) [pdf, other]
Title: EGGS: Edge Guided Gaussian Splatting for Radiance Fields
Authors: Yuanhao Gong
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Image and Video Processing (eess.IV)
[262]  arXiv:2404.09359 (replaced) [pdf, other]
Title: Exploring Feedback Generation in Automated Skeletal Movement Assessment: A Comprehensive Overview
Authors: Tal Hakim
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[263]  arXiv:2404.09498 (replaced) [pdf, other]
Title: FusionMamba: Dynamic Feature Enhancement for Multimodal Image Fusion with Mamba
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[264]  arXiv:2404.09640 (replaced) [pdf, other]
Title: CREST: Cross-modal Resonance through Evidential Deep Learning for Enhanced Zero-Shot Learning
Comments: Ongoing work; 10 pages, 2 Tables, 9 Figures; Repo is available at: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[265]  arXiv:2404.10108 (replaced) [pdf, other]
Title: GeoAI Reproducibility and Replicability: a computational and spatial perspective
Comments: Accepted by Annals of the American Association of Geographers
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[266]  arXiv:2404.10163 (replaced) [pdf, other]
Title: EyeFormer: Predicting Personalized Scanpaths with Transformer-Guided Reinforcement Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
[267]  arXiv:2404.11156 (replaced) [pdf, ps, other]
Title: Learning SO(3)-Invariant Semantic Correspondence via Local Shape Transform
Comments: Accepted to CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[268]  arXiv:2404.11202 (replaced) [pdf, other]
Title: GhostNetV3: Exploring the Training Strategies for Compact Models
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[269]  arXiv:2404.12216 (replaced) [pdf, other]
Title: ProTA: Probabilistic Token Aggregation for Text-Video Retrieval
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[270]  arXiv:2404.12322 (replaced) [pdf, other]
Title: Generalizable Face Landmarking Guided by Conditional Face Warping
Comments: Accepted in CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[271]  arXiv:2404.12379 (replaced) [pdf, other]
Title: Dynamic Gaussians Mesh: Consistent Mesh Reconstruction from Monocular Videos
Comments: Project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[272]  arXiv:2404.12547 (replaced) [pdf, other]
Title: Does Gaussian Splatting need SFM Initialization?
Comments: 14 pages, 6 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[273]  arXiv:2404.12734 (replaced) [pdf, other]
Title: DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer
Authors: Da Chang, Yu Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[274]  arXiv:2205.06265 (replaced) [pdf, other]
Title: ELODI: Ensemble Logit Difference Inhibition for Positive-Congruent Training
Comments: Accepted as a Regular Paper in TPAMI. Code is at this https URL
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[275]  arXiv:2207.04934 (replaced) [pdf, other]
Title: Multilevel Geometric Optimization for Regularised Constrained Linear Inverse Problems
Comments: 25 pages, 6 figures
Journal-ref: Pure and Applied Functional Analysis, Vol. 8 (2023), No. 3, pp. 855-880
Subjects: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Differential Geometry (math.DG)
[276]  arXiv:2305.12621 (replaced) [pdf, other]
Title: DermSynth3D: Synthesis of in-the-wild Annotated Dermatology Images
Comments: Accepted to Medical Image Analysis (MedIA) 2024
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[277]  arXiv:2306.05175 (replaced) [pdf, other]
Title: Large-scale Dataset Pruning with Dynamic Uncertainty
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[278]  arXiv:2306.16927 (replaced) [pdf, other]
Title: End-to-end Autonomous Driving: Challenges and Frontiers
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[279]  arXiv:2308.04956 (replaced) [pdf, other]
Title: Improved cryo-EM Pose Estimation and 3D Classification through Latent-Space Disentanglement
Comments: 21 pages
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
[280]  arXiv:2309.03641 (replaced) [pdf, other]
Title: Spiking Structured State Space Model for Monaural Speech Enhancement
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
[281]  arXiv:2310.09457 (replaced) [pdf, other]
Title: UCM-Net: A Lightweight and Efficient Solution for Skin Lesion Segmentation using MLP and CNN
Comments: 17 pages, under review
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[282]  arXiv:2312.02439 (replaced) [pdf, other]
Title: Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation
Comments: Technical report
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
[283]  arXiv:2312.13534 (replaced) [pdf, other]
Title: SE(3)-Equivariant and Noise-Invariant 3D Rigid Motion Tracking in Brain MRI
Comments: under review
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[284]  arXiv:2312.15320 (replaced) [pdf, ps, other]
Title: GestaltMML: Enhancing Rare Genetic Disease Diagnosis through Multimodal Machine Learning Combining Facial Images and Clinical Texts
Comments: Significant revisions
Subjects: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Genomics (q-bio.GN)
[285]  arXiv:2401.09630 (replaced) [pdf, other]
Title: CT Liver Segmentation via PVT-based Encoding and Refined Decoding
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[286]  arXiv:2401.17484 (replaced) [pdf, other]
Title: Pixel to Elevation: Learning to Predict Elevation Maps at Long Range using Images for Autonomous Offroad Navigation
Comments: 8 pages, 6 figures, Accepted in IEEE Robotics and Automation Letters (RA-L)
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[287]  arXiv:2402.09181 (replaced) [pdf, other]
Title: OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[288]  arXiv:2402.16368 (replaced) [pdf, other]
Title: SPINEPS -- Automatic Whole Spine Segmentation of T2-weighted MR images using a Two-Phase Approach to Multi-class Semantic and Instance Segmentation
Comments: this https URL
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[289]  arXiv:2403.06687 (replaced) [pdf, other]
Title: Advancing Graph Neural Networks with HL-HGAT: A Hodge-Laplacian and Attention Mechanism Approach for Heterogeneous Graph-Structured Data
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[290]  arXiv:2403.16967 (replaced) [pdf, other]
Title: Visual Whole-Body Control for Legged Loco-Manipulation
Comments: Add more details. The first two authors contribute equally. Project page: this https URL
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[291]  arXiv:2403.18985 (replaced) [pdf, ps, other]
Title: Robustness and Visual Explanation for Black Box Image, Video, and ECG Signal Classification with Reinforcement Learning
Comments: AAAI Proceedings reference: this https URL
Journal-ref: 2024 Proceedings of the AAAI Conference on Artificial Intelligence
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
[292]  arXiv:2403.19966 (replaced) [pdf, other]
Title: Multi-task Magnetic Resonance Imaging Reconstruction using Meta-learning
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
[293]  arXiv:2404.00717 (replaced) [pdf, other]
Title: End-to-End Autonomous Driving through V2X Cooperation
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
[294]  arXiv:2404.01643 (replaced) [pdf, other]
Title: A Closer Look at Spatial-Slice Features Learning for COVID-19 Detection
Comments: Camera-ready version, accepted by DEF-AI-MIA workshop, in conjunted with CVPR2024
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[295]  arXiv:2404.04848 (replaced) [pdf, other]
Title: Task-Aware Encoder Control for Deep Video Compression
Comments: Accepted by CVPR 2024
Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
[296]  arXiv:2404.11947 (replaced) [pdf, other]
Title: VCC-INFUSE: Towards Accurate and Efficient Selection of Unlabeled Examples in Semi-supervised Learning
Comments: Accepted paper of IJCAI 2024. Shijie Fang and Qianhan Feng contributed equally to this paper. New version, some problems and typos are fixed
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[297]  arXiv:2404.12814 (replaced) [pdf, other]
Title: Generative Modelling with High-Order Langevin Dynamics
Comments: Some of the results in this paper have been published or accepted at conferences such as wacv2024, icassp2024, and icme2024
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
[ total of 297 entries: 1-297 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2404, contact, help  (Access key information)