We gratefully acknowledge support from
the Simons Foundation and member institutions.

Computer Vision and Pattern Recognition

New submissions

[ total of 89 entries: 1-89 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Wed, 8 Feb 23

[1]  arXiv:2302.03064 [pdf, other]
Title: Investigating Pulse-Echo Sound Speed Estimation in Breast Ultrasound with Deep Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Ultrasound is an adjunct tool to mammography that can quickly and safely aid physicians with diagnosing breast abnormalities. Clinical ultrasound often assumes a constant sound speed to form B-mode images for diagnosis. However, the various types of breast tissue, such as glandular, fat, and lesions, differ in sound speed. These differences can degrade the image reconstruction process. Alternatively, sound speed can be a powerful tool for identifying disease. To this end, we propose a deep-learning approach for sound speed estimation from in-phase and quadrature ultrasound signals. First, we develop a large-scale simulated ultrasound dataset that generates quasi-realistic breast tissue by modeling breast gland, skin, and lesions with varying echogenicity and sound speed. We developed a fully convolutional neural network architecture trained on a simulated dataset to produce an estimated sound speed map from inputting three complex-value in-phase and quadrature ultrasound images formed from plane-wave transmissions at separate angles. Furthermore, thermal noise augmentation is used during model optimization to enhance generalizability to real ultrasound data. We evaluate the model on simulated, phantom, and in-vivo breast ultrasound data, demonstrating its ability to accurately estimate sound speeds consistent with previously reported values in the literature. Our simulated dataset and model will be publicly available to provide a step towards accurate and generalizable sound speed estimation for pulse-echo ultrasound imaging.

[2]  arXiv:2302.03084 [pdf, other]
Title: Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In Composed Image Retrieval (CIR), a user combines a query image with text to describe their intended target. Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image. Labeling such triplets is expensive and hinders broad applicability of CIR. In this work, we propose to study an important task, Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to build a CIR model without requiring labeled triplets for training. To this end, we propose a novel method, called Pic2Word, that requires only weakly labeled image-caption pairs and unlabeled image datasets to train. Unlike existing supervised CIR models, our model trained on weakly labeled or unlabeled datasets shows strong generalization across diverse ZS-CIR tasks, e.g., attribute editing, object composition, and domain conversion. Our approach outperforms several supervised CIR methods on the common CIR benchmark, CIRR and Fashion-IQ. Code will be made publicly available at https://github.com/google-research/composed_image_retrieval.

[3]  arXiv:2302.03114 [pdf, other]
Title: From CAD models to soft point cloud labels: An automatic annotation pipeline for cheaply supervised 3D semantic segmentation
Comments: This work has been submitted to the IEEE for possible publication
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We propose a fully automatic annotation scheme which takes a raw 3D point cloud with a set of fitted CAD models as input, and outputs convincing point-wise labels which can be used as cheap training data for point cloud segmentation. Compared to manual annotations, we show that our automatic labels are accurate while drastically reducing the annotation time, and eliminating the need for manual intervention or dataset-specific parameters. Our labeling pipeline outputs semantic classes and soft point-wise object scores which can either be binarized into standard one-hot-encoded labels, thresholded into weak labels with ambiguous points left unlabeled, or used directly as soft labels during training. We evaluate the label quality and segmentation performance of PointNet++ on a dataset of real industrial point clouds and Scan2CAD, a public dataset of indoor scenes. Our results indicate that reducing supervision in areas which are more difficult to label automatically is beneficial, compared to the conventional approach of naively assigning a hard "best guess" label to every point.

[4]  arXiv:2302.03120 [pdf, other]
Title: Studying Therapy Effects and Disease Outcomes in Silico using Artificial Counterfactual Tissue Samples
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Understanding the interactions of different cell types inside the immune tumor microenvironment (iTME) is crucial for the development of immunotherapy treatments as well as for predicting their outcomes. Highly multiplexed tissue imaging (HMTI) technologies offer a tool which can capture cell properties of tissue samples by measuring expression of various proteins and storing them in separate image channels. HMTI technologies can be used to gain insights into the iTME and in particular how the iTME differs for different patient outcome groups of interest (e.g., treatment responders vs. non-responders). Understanding the systematic differences in the iTME of different patient outcome groups is crucial for developing better treatments and personalising existing treatments. However, such analyses are inherently limited by the fact that any two tissue samples vary due to a large number of factors unrelated to the outcome. Here, we present CF-HistoGAN, a machine learning framework that employs generative adversarial networks (GANs) to create artificial counterfactual tissue samples that resemble the original tissue samples as closely as possible but capture the characteristics of a different patient outcome group. Specifically, we learn to "translate" HMTI samples from one patient group to create artificial paired samples. We show that this approach allows to directly study the effects of different patient outcomes on the iTMEs of individual tissue samples. We demonstrate that CF-HistoGAN can be employed as an explorative tool for understanding iTME effects on the pixel level. Moreover, we show that our method can be used to identify statistically significant differences in the expression of different proteins between patient groups with greater sensitivity compared to conventional approaches.

[5]  arXiv:2302.03128 [pdf, other]
Title: Cooperverse: A Mobile-Edge-Cloud Framework for Universal Cooperative Perception with Mixed Connectivity and Automation
Comments: 6 pages, 7 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)

Cooperative perception (CP) is attracting increasing attention and is regarded as the core foundation to support cooperative driving automation, a potential key solution to addressing the safety, mobility, and sustainability issues of contemporary transportation systems. However, current research on CP is still at the beginning stages where a systematic problem formulation of CP is still missing, acting as the essential guideline of the system design of a CP system under real-world situations. In this paper, we formulate a universal CP system into an optimization problem and a mobile-edge-cloud framework called Cooperverse. This system addresses CP in a mixed connectivity and automation environment. A Dynamic Feature Sharing (DFS) methodology is introduced to support this CP system under certain constraints and a Random Priority Filtering (RPF) method is proposed to conduct DFS with high performance. Experiments have been conducted based on a high-fidelity CP platform, and the results show that the Cooperverse framework is effective for dynamic node engagement and the proposed DFS methodology can improve system CP performance by 14.5% and the RPF method can reduce the communication cost for mobile nodes by 90% with only 1.7% drop for average precision.

[6]  arXiv:2302.03156 [pdf, other]
Title: Novel Building Detection and Location Intelligence Collection in Aerial Satellite Imagery
Comments: 9 pages(5 main pages, 4 auxiliary pages)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

Building structures detection and information about these buildings in aerial images is an important solution for city planning and management, land use analysis. It can be the center piece to answer important questions such as planning evacuation routes in case of an earthquake, flood management, etc. These applications rely on being able to accurately retrieve up-to-date information. Being able to accurately detect buildings in a bounding box centered on a specific latitude-longitude value can help greatly. The key challenge is to be able to detect buildings which can be commercial, industrial, hut settlements, or skyscrapers. Once we are able to detect such buildings, our goal will be to cluster and categorize similar types of buildings together.

[7]  arXiv:2302.03198 [pdf, other]
Title: Scaling Self-Supervised End-to-End Driving with Multi-View Attention Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV)

On end-to-end driving, a large amount of expert driving demonstrations is used to train an agent that mimics the expert by predicting its control actions. This process is self-supervised on vehicle signals (e.g., steering angle, acceleration) and does not require extra costly supervision (human labeling). Yet, the improvement of existing self-supervised end-to-end driving models has mostly given room to modular end-to-end models where labeling data intensive format such as semantic segmentation are required during training time. However, we argue that the latest self-supervised end-to-end models were developed in sub-optimal conditions with low-resolution images and no attention mechanisms. Further, those models are confined with limited field of view and far from the human visual cognition which can quickly attend far-apart scene features, a trait that provides an useful inductive bias. In this context, we present a new end-to-end model, trained by self-supervised imitation learning, leveraging a large field of view and a self-attention mechanism. These settings are more contributing to the agent's understanding of the driving scene, which brings a better imitation of human drivers. With only self-supervised training data, our model yields almost expert performance in CARLA's Nocrash metrics and could be rival to the SOTA models requiring large amounts of human labeled data. To facilitate further research, our code will be released.

[8]  arXiv:2302.03242 [pdf, other]
Title: Online Misinformation Video Detection: A Survey
Comments: 10 pages, 2 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Social and Information Networks (cs.SI)

With information consumption via online video streaming becoming increasingly popular, misinformation video poses a new threat to the health of the online information ecosystem. Though previous studies have made much progress in detecting misinformation in text and image formats, video-based misinformation brings new and unique challenges to automatic detection systems: 1) high information heterogeneity brought by various modalities, 2) blurred distinction between misleading video manipulation and ubiquitous artistic video editing, and 3) new patterns of misinformation propagation due to the dominant role of recommendation systems on online video platforms. To facilitate research on this challenging task, we conduct this survey to present advances in misinformation video detection research. We first analyze and characterize the misinformation video from three levels including signals, semantics, and intents. Based on the characterization, we systematically review existing works for detection from features of various modalities to techniques for clue integration. We also introduce existing resources including representative datasets and widely used tools. Besides summarizing existing studies, we discuss related areas and outline open issues and future directions to encourage and guide more research on misinformation video detection. Our corresponding public repository is available at https://github.com/ICTMCG/Awesome-Misinfo-Video-Detection.

[9]  arXiv:2302.03264 [pdf, other]
Title: Delving Deep into Simplicity Bias for Long-Tailed Image Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Simplicity Bias (SB) is a phenomenon that deep neural networks tend to rely favorably on simpler predictive patterns but ignore some complex features when applied to supervised discriminative tasks. In this work, we investigate SB in long-tailed image recognition and find the tail classes suffer more severely from SB, which harms the generalization performance of such underrepresented classes. We empirically report that self-supervised learning (SSL) can mitigate SB and perform in complementary to the supervised counterpart by enriching the features extracted from tail samples and consequently taking better advantage of such rare samples. However, standard SSL methods are designed without explicitly considering the inherent data distribution in terms of classes and may not be optimal for long-tailed distributed data. To address this limitation, we propose a novel SSL method tailored to imbalanced data. It leverages SSL by triple diverse levels, i.e., holistic-, partial-, and augmented-level, to enhance the learning of predictive complex patterns, which provides the potential to overcome the severe SB on tail data. Both quantitative and qualitative experimental results on five long-tailed benchmark datasets show our method can effectively mitigate SB and significantly outperform the competing state-of-the-arts.

[10]  arXiv:2302.03282 [pdf, other]
Title: An End-to-End Two-Phase Deep Learning-Based workflow to Segment Man-made Objects Around Reservoirs
Comments: 21 pages, 13 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

Reservoirs are fundamental infrastructures for the management of water resources. Constructions around them can negatively impact their quality. Such unauthorized constructions can be monitored by land cover mapping (LCM) remote sensing (RS) images. In this paper, we develop a new approach based on DL and image processing techniques for man-made object segmentation around the reservoirs. In order to segment man-made objects around the reservoirs in an end-to-end procedure, segmenting reservoirs and identifying the region of interest (RoI) around them are essential. In the proposed two-phase workflow, the reservoir is initially segmented using a DL model. A post-processing stage is proposed to remove errors such as floating vegetation. Next, the RoI around the reservoir (RoIaR) is identified using the proposed image processing techniques. Finally, the man-made objects in the RoIaR are segmented using a DL architecture. We trained the proposed workflow using collected Google Earth (GE) images of eight reservoirs in Brazil over two different years. The U-Net-based and SegNet-based architectures are trained to segment the reservoirs. To segment man-made objects in the RoIaR, we trained and evaluated four possible architectures, U-Net, FPN, LinkNet, and PSPNet. Although the collected data has a high diversity (for example, they belong to different states, seasons, resolutions, etc.), we achieved good performances in both phases. Furthermore, applying the proposed post-processing to the output of reservoir segmentation improves the precision in all studied reservoirs except two cases. We validated the prepared workflow with a reservoir dataset outside the training reservoirs. The results show high generalization ability of the prepared workflow.

[11]  arXiv:2302.03292 [pdf, other]
Title: Fine-grained Affordance Annotation for Egocentric Hand-Object Interaction Videos
Comments: WACV. arXiv admin note: substantial text overlap with arXiv:2206.05424
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Object affordance is an important concept in hand-object interaction, providing information on action possibilities based on human motor capacity and objects' physical property thus benefiting tasks such as action anticipation and robot imitation learning. However, the definition of affordance in existing datasets often: 1) mix up affordance with object functionality; 2) confuse affordance with goal-related action; and 3) ignore human motor capacity. This paper proposes an efficient annotation scheme to address these issues by combining goal-irrelevant motor actions and grasp types as affordance labels and introducing the concept of mechanical action to represent the action possibilities between two objects. We provide new annotations by applying this scheme to the EPIC-KITCHENS dataset and test our annotation with tasks such as affordance recognition, hand-object interaction hotspots prediction, and cross-domain evaluation of affordance. The results show that models trained with our annotation can distinguish affordance from other concepts, predict fine-grained interaction possibilities on objects, and generalize through different domains.

[12]  arXiv:2302.03298 [pdf, other]
Title: Boosting Zero-shot Classification with Synthetic Data Diversity via Stable Diffusion
Comments: (7 pages, 3 figures, 2 tables, preprint)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Recent research has shown it is possible to perform zero-shot classification tasks by training a classifier with synthetic data generated by a diffusion model. However, the performance of this approach is still inferior to that of recent vision-language models. It has been suggested that the reason for this is a domain gap between the synthetic and real data. In our work, we show that this domain gap is not the main issue, and that diversity in the synthetic dataset is more important. We propose a \textit{bag of tricks} to improve diversity and are able to achieve performance on par with one of the vision-language models, CLIP. More importantly, this insight allows us to endow zero-shot classification capabilities on any classification model.

[13]  arXiv:2302.03318 [pdf]
Title: PAMI: partition input and aggregate outputs for model interpretation
Comments: 28 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

There is an increasing demand for interpretation of model predictions especially in high-risk applications. Various visualization approaches have been proposed to estimate the part of input which is relevant to a specific model prediction. However, most approaches require model structure and parameter details in order to obtain the visualization results, and in general much effort is required to adapt each approach to multiple types of tasks particularly when model backbone and input format change over tasks. In this study, a simple yet effective visualization framework called PAMI is proposed based on the observation that deep learning models often aggregate features from local regions for model predictions. The basic idea is to mask majority of the input and use the corresponding model output as the relative contribution of the preserved input part to the original model prediction. For each input, since only a set of model outputs are collected and aggregated, PAMI does not require any model detail and can be applied to various prediction tasks with different model backbones and input formats. Extensive experiments on multiple tasks confirm the proposed method performs better than existing visualization approaches in more precisely finding class-specific input regions, and when applied to different model backbones and input formats. The source code will be released publicly.

[14]  arXiv:2302.03397 [pdf, other]
Title: AniPixel: Towards Animatable Pixel-Aligned Human Avatar
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Neural radiance field using pixel-aligned features can render photo-realistic novel views. However, when pixel-aligned features are directly introduced to human avatar reconstruction, the rendering can only be conducted for still humans, rather than animatable avatars. In this paper, we propose AniPixel, a novel animatable and generalizable human avatar reconstruction method that leverages pixel-aligned features for body geometry prediction and RGB color blending. Technically, to align the canonical space with the target space and the observation space, we propose a bidirectional neural skinning field based on skeleton-driven deformation to establish the target-to-canonical and canonical-to-observation correspondences. Then, we disentangle the canonical body geometry into a normalized neutral-sized body and a subject-specific residual for better generalizability. As the geometry and appearance are closely related, we introduce pixel-aligned features to facilitate the body geometry prediction and detailed surface normals to reinforce the RGB color blending. Moreover, we devise a pose-dependent and view direction-related shading module to represent the local illumination variance. Experiments show that our AniPixel renders comparable novel views while delivering better novel pose animation results than state-of-the-art methods. The code will be released.

[15]  arXiv:2302.03406 [pdf, other]
Title: High-Resolution GAN Inversion for Degraded Images in Large Diverse Datasets
Comments: Accepted by AAAI2023
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The last decades are marked by massive and diverse image data, which shows increasingly high resolution and quality. However, some images we obtained may be corrupted, affecting the perception and the application of downstream tasks. A generic method for generating a high-quality image from the degraded one is in demand. In this paper, we present a novel GAN inversion framework that utilizes the powerful generative ability of StyleGAN-XL for this problem. To ease the inversion challenge with StyleGAN-XL, Clustering \& Regularize Inversion (CRI) is proposed. Specifically, the latent space is firstly divided into finer-grained sub-spaces by clustering. Instead of initializing the inversion with the average latent vector, we approximate a centroid latent vector from the clusters, which generates an image close to the input image. Then, an offset with a regularization term is introduced to keep the inverted latent vector within a certain range. We validate our CRI scheme on multiple restoration tasks (i.e., inpainting, colorization, and super-resolution) of complex natural images, and show preferable quantitative and qualitative results. We further demonstrate our technique is robust in terms of data and different GAN models. To our best knowledge, we are the first to adopt StyleGAN-XL for generating high-quality natural images from diverse degraded inputs. Code is available at https://github.com/Booooooooooo/CRI.

[16]  arXiv:2302.03432 [pdf, other]
Title: SimCon Loss with Multiple Views for Text Supervised Semantic Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Learning to segment images purely by relying on the image-text alignment from web data can lead to sub-optimal performance due to noise in the data. The noise comes from the samples where the associated text does not correlate with the image's visual content. Instead of purely relying on the alignment from the noisy data, this paper proposes a novel loss function termed SimCon, which accounts for intra-modal similarities to determine the appropriate set of positive samples to align. Further, using multiple views of the image (created synthetically) for training and combining the SimCon loss with it makes the training more robust. This version of the loss is termed MV-SimCon. The empirical results demonstrate that using the proposed loss function leads to consistent improvements on zero-shot, text supervised semantic segmentation and outperforms state-of-the-art by $+3.0\%$, $+3.3\%$ and $+6.9\%$ on PASCAL VOC, PASCAL Context and MSCOCO, respectively. With test time augmentations, we set a new record by improving these results further to $58.7\%$, $26.6\%$, and $33.3\%$ on PASCAL VOC, PASCAL Context, and MSCOCO, respectively. In addition, using the proposed loss function leads to robust training and faster convergence.

[17]  arXiv:2302.03442 [pdf, other]
Title: Using t-distributed stochastic neighbor embedding for visualization and segmentation of 3D point clouds of plants
Authors: Helin Dutagaci
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this work, the use of t-SNE is proposed to embed 3D point clouds of plants into 2D space for plant characterization. It is demonstrated that t-SNE operates as a practical tool to flatten and visualize a complete 3D plant model in 2D space. The perplexity parameter of t-SNE allows 2D rendering of plant structures at various organizational levels. Aside from the promise of serving as a visualization tool for plant scientists, t-SNE also provides a gateway for processing 3D point clouds of plants using their embedded counterparts in 2D. In this paper, simple methods were proposed to perform semantic segmentation and instance segmentation via grouping the embedded 2D points. The evaluation of these methods on a public 3D plant data set conveys the potential of t-SNE for enabling of 2D implementation of various steps involved in automatic 3D phenotyping pipelines.

[18]  arXiv:2302.03477 [pdf, other]
Title: Explainable Action Prediction through Self-Supervision on Scene Graphs
Comments: Accepted to the 2023 IEEE International Conference on Robotics and Automation (ICRA)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

This work explores scene graphs as a distilled representation of high-level information for autonomous driving, applied to future driver-action prediction. Given the scarcity and strong imbalance of data samples, we propose a self-supervision pipeline to infer representative and well-separated embeddings. Key aspects are interpretability and explainability; as such, we embed in our architecture attention mechanisms that can create spatial and temporal heatmaps on the scene graphs. We evaluate our system on the ROAD dataset against a fully-supervised approach, showing the superiority of our training regime.

[19]  arXiv:2302.03523 [pdf, other]
Title: Sparse Mixture Once-for-all Adversarial Training for Efficient In-Situ Trade-Off Between Accuracy and Robustness of DNNs
Comments: 5 pages, 5 figures, 2 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Existing deep neural networks (DNNs) that achieve state-of-the-art (SOTA) performance on both clean and adversarially-perturbed images rely on either activation or weight conditioned convolution operations. However, such conditional learning costs additional multiply-accumulate (MAC) or addition operations, increasing inference memory and compute costs. To that end, we present a sparse mixture once for all adversarial training (SMART), that allows a model to train once and then in-situ trade-off between accuracy and robustness, that too at a reduced compute and parameter overhead. In particular, SMART develops two expert paths, for clean and adversarial images, respectively, that are then conditionally trained via respective dedicated sets of binary sparsity masks. Extensive evaluations on multiple image classification datasets across different models show SMART to have up to 2.72x fewer non-zero parameters costing proportional reduction in compute overhead, while yielding SOTA accuracy-robustness trade-off. Additionally, we present insightful observations in designing sparse masks to successfully condition on both clean and perturbed images.

[20]  arXiv:2302.03531 [pdf, other]
Title: Structured Generative Models for Scene Understanding
Comments: 33 pages, 10 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This position paper argues for the use of \emph{structured generative models} (SGMs) for scene understanding. This requires the reconstruction of a 3D scene from an input image, whereby the contents of the image are causally explained in terms of models of instantiated objects, each with their own type, shape, appearance and pose, along with global variables like scene lighting and camera parameters. This approach also requires scene models which account for the co-occurrences and inter-relationships of objects in a scene. The SGM approach has the merits that it is compositional and generative, which lead to interpretability.
To pursue the SGM agenda, we need models for objects and scenes, and approaches to carry out inference. We first review models for objects, which include ``things'' (object categories that have a well defined shape), and ``stuff'' (categories which have amorphous spatial extent). We then move on to review \emph{scene models} which describe the inter-relationships of objects. Perhaps the most challenging problem for SGMs is \emph{inference} of the objects, lighting and camera parameters, and scene inter-relationships from input consisting of a single or multiple images. We conclude with a discussion of issues that need addressing to advance the SGM agenda.

[21]  arXiv:2302.03533 [pdf, other]
Title: Revisiting Pre-training in Audio-Visual Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Pre-training technique has gained tremendous success in enhancing model performance on various tasks, but found to perform worse than training from scratch in some uni-modal situations. This inspires us to think: are the pre-trained models always effective in the more complex multi-modal scenario, especially for the heterogeneous modalities such as audio and visual ones? We find that the answer is No. Specifically, we explore the effects of pre-trained models on two audio-visual learning scenarios: cross-modal initialization and multi-modal joint learning. When cross-modal initialization is applied, the phenomena of "dead channel" caused by abnormal Batchnorm parameters hinders the utilization of model capacity. Thus, we propose Adaptive Batchnorm Re-initialization (ABRi) to better exploit the capacity of pre-trained models for target tasks. In multi-modal joint learning, we find a strong pre-trained uni-modal encoder would bring negative effects on the encoder of another modality. To alleviate such problem, we introduce a two-stage Fusion Tuning strategy, taking better advantage of the pre-trained knowledge while making the uni-modal encoders cooperate with an adaptive masking method. The experiment results show that our methods could further exploit pre-trained models' potential and boost performance in audio-visual learning.

[22]  arXiv:2302.03548 [pdf, other]
Title: PhysFormer++: Facial Video-based Physiological Measurement with SlowFast Temporal Difference Transformer
Comments: Accepted by International Journal of Computer Vision (IJCV). arXiv admin note: substantial text overlap with arXiv:2111.12082
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Remote photoplethysmography (rPPG), which aims at measuring heart activities and physiological signals from facial video without any contact, has great potential in many applications (e.g., remote healthcare and affective computing). Recent deep learning approaches focus on mining subtle rPPG clues using convolutional neural networks with limited spatio-temporal receptive fields, which neglect the long-range spatio-temporal perception and interaction for rPPG modeling. In this paper, we propose two end-to-end video transformer based architectures, namely PhysFormer and PhysFormer++, to adaptively aggregate both local and global spatio-temporal features for rPPG representation enhancement. As key modules in PhysFormer, the temporal difference transformers first enhance the quasi-periodic rPPG features with temporal difference guided global attention, and then refine the local spatio-temporal representation against interference. To better exploit the temporal contextual and periodic rPPG clues, we also extend the PhysFormer to the two-pathway SlowFast based PhysFormer++ with temporal difference periodic and cross-attention transformers. Furthermore, we propose the label distribution learning and a curriculum learning inspired dynamic constraint in frequency domain, which provide elaborate supervisions for PhysFormer and PhysFormer++ and alleviate overfitting. Comprehensive experiments are performed on four benchmark datasets to show our superior performance on both intra- and cross-dataset testings. Unlike most transformer networks needed pretraining from large-scale datasets, the proposed PhysFormer family can be easily trained from scratch on rPPG datasets, which makes it promising as a novel transformer baseline for the rPPG community.

[23]  arXiv:2302.03566 [pdf, other]
Title: Look around and learn: self-improving object detection by exploration
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Object detectors often experience a drop in performance when new environmental conditions are insufficiently represented in the training data. This paper studies how to automatically fine-tune a pre-existing object detector while exploring and acquiring images in a new environment without relying on human intervention, i.e., in an utterly self-supervised fashion. In our setting, an agent initially learns to explore the environment using a pre-trained off-the-shelf detector to locate objects and associate pseudo-labels. By assuming that pseudo-labels for the same object must be consistent across different views, we learn an exploration policy mining hard samples and we devise a novel mechanism for producing refined predictions from the consensus among observations. Our approach outperforms the current state-of-the-art, and it closes the performance gap against a fully supervised setting without relying on ground-truth annotations. We also compare various exploration policies for the agent to gather more informative observations. Code and dataset will be made available upon paper acceptance

[24]  arXiv:2302.03594 [pdf, other]
Title: NICER-SLAM: Neural Implicit Scene Encoding for RGB SLAM
Comments: Video: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Neural implicit representations have recently become popular in simultaneous localization and mapping (SLAM), especially in dense visual SLAM. However, previous works in this direction either rely on RGB-D sensors, or require a separate monocular SLAM approach for camera tracking and do not produce high-fidelity dense 3D scene reconstruction. In this paper, we present NICER-SLAM, a dense RGB SLAM system that simultaneously optimizes for camera poses and a hierarchical neural implicit map representation, which also allows for high-quality novel view synthesis. To facilitate the optimization process for mapping, we integrate additional supervision signals including easy-to-obtain monocular geometric cues and optical flow, and also introduce a simple warping loss to further enforce geometry consistency. Moreover, to further boost performance in complicated indoor scenes, we also propose a local adaptive transformation from signed distance functions (SDFs) to density in the volume rendering equation. On both synthetic and real-world datasets we demonstrate strong performance in dense mapping, tracking, and novel view synthesis, even competitive with recent RGB-D SLAM systems.

[25]  arXiv:2302.03629 [pdf, ps, other]
Title: Ethical Considerations for Collecting Human-Centric Image Datasets
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)

Human-centric image datasets are critical to the development of computer vision technologies. However, recent investigations have foregrounded significant ethical issues related to privacy and bias, which have resulted in the complete retraction, or modification, of several prominent datasets. Recent works have tried to reverse this trend, for example, by proposing analytical frameworks for ethically evaluating datasets, the standardization of dataset documentation and curation practices, privacy preservation methodologies, as well as tools for surfacing and mitigating representational biases. Little attention, however, has been paid to the realities of operationalizing ethical data collection. To fill this gap, we present a set of key ethical considerations and practical recommendations for collecting more ethically-minded human-centric image data. Our research directly addresses issues of privacy and bias by contributing to the research community best practices for ethical data collection, covering purpose, privacy and consent, as well as diversity. We motivate each consideration by drawing on lessons from current practices, dataset withdrawals and audits, and analytical ethical frameworks. Our research is intended to augment recent scholarship, representing an important step toward more responsible data curation practices.

[26]  arXiv:2302.03640 [pdf, other]
Title: S4R: Self-Supervised Semantic Scene Reconstruction from RGB-D Scans
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Most deep learning approaches to comprehensive semantic modeling of 3D indoor spaces require costly dense annotations in the 3D domain. In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction, using a fully self-supervised approach. To this end, we design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images, fusing cross-domain features into volumetric embeddings to predict complete 3D geometry, color, and semantics. Our key technical innovation is to leverage differentiable rendering of color and semantics, using the observed RGB images and a generic semantic segmentation model as color and semantics supervision, respectively. We additionally develop a method to synthesize an augmented set of virtual training views complementing the original real captures, enabling more efficient self-supervision for semantics. In this work we propose an end-to-end trainable solution jointly addressing geometry completion, colorization, and semantic mapping from a few RGB-D images, without 3D or 2D ground-truth. Our method is the first, to our knowledge, fully self-supervised method addressing completion and semantic segmentation of real-world 3D scans. It performs comparably well with the 3D supervised baselines, surpasses baselines with 2D supervision on real datasets, and generalizes well to unseen scenes.

[27]  arXiv:2302.03648 [pdf, other]
Title: Deep Class-Incremental Learning: A Survey
Comments: Code is available at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Deep models, e.g., CNNs and Vision Transformers, have achieved impressive achievements in many vision tasks in the closed world. However, novel classes emerge from time to time in our ever-changing world, requiring a learning system to acquire new knowledge continually. For example, a robot needs to understand new instructions, and an opinion monitoring system should analyze emerging topics every day. Class-Incremental Learning (CIL) enables the learner to incorporate the knowledge of new classes incrementally and build a universal classifier among all seen classes. Correspondingly, when directly training the model with new class instances, a fatal problem occurs -- the model tends to catastrophically forget the characteristics of former ones, and its performance drastically degrades. There have been numerous efforts to tackle catastrophic forgetting in the machine learning community. In this paper, we survey comprehensively recent advances in deep class-incremental learning and summarize these methods from three aspects, i.e., data-centric, model-centric, and algorithm-centric. We also provide a rigorous and unified evaluation of 16 methods in benchmark image classification tasks to find out the characteristics of different algorithms empirically. Furthermore, we notice that the current comparison protocol ignores the influence of memory budget in model storage, which may result in unfair comparison and biased results. Hence, we advocate fair comparison by aligning the memory budget in evaluation, as well as several memory-agnostic performance measures. The source code to reproduce these evaluations is available at https://github.com/zhoudw-zdw/CIL_Survey/

[28]  arXiv:2302.03657 [pdf, other]
Title: Toward Face Biometric De-identification using Adversarial Examples
Comments: Accepted at the AAAI-23 workshop on Artificial Intelligence for Cyber Security (AICS)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

The remarkable success of face recognition (FR) has endangered the privacy of internet users particularly in social media. Recently, researchers turned to use adversarial examples as a countermeasure. In this paper, we assess the effectiveness of using two widely known adversarial methods (BIM and ILLC) for de-identifying personal images. We discovered, unlike previous claims in the literature, that it is not easy to get a high protection success rate (suppressing identification rate) with imperceptible adversarial perturbation to the human visual system. Finally, we found out that the transferability of adversarial examples is highly affected by the training parameters of the network with which they are generated.

[29]  arXiv:2302.03665 [pdf, other]
Title: HumanMAC: Masked Motion Completion for Human Motion Prediction
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Human motion prediction is a classical problem in computer vision and computer graphics, which has a wide range of practical applications. Previous effects achieve great empirical performance based on an encoding-decoding fashion. The methods of this fashion work by first encoding previous motions to latent representations and then decoding the latent representations into predicted motions. However, in practice, they are still unsatisfactory due to several issues, including complicated loss constraints, cumbersome training processes, and scarce switch of different categories of motions in prediction. In this paper, to address the above issues, we jump out of the foregoing fashion and propose a novel framework from a new perspective. Specifically, our framework works in a denoising diffusion style. In the training stage, we learn a motion diffusion model that generates motions from random noise. In the inference stage, with a denoising procedure, we make motion prediction conditioning on observed motions to output more continuous and controllable predictions. The proposed framework enjoys promising algorithmic properties, which only needs one loss in optimization and is trained in an end-to-end manner. Additionally, it accomplishes the switch of different categories of motions effectively, which is significant in realistic tasks, \textit{e.g.}, the animation task. Comprehensive experiments on benchmarks confirm the superiority of the proposed framework. The project page is available at \url{https://lhchen.top/Human-MAC}.

[30]  arXiv:2302.03675 [pdf, other]
Title: Auditing Gender Presentation Differences in Text-to-Image Models
Comments: Preprint, 23 pages, 14 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

Text-to-image models, which can generate high-quality images based on textual input, have recently enabled various content-creation tools. Despite significantly affecting a wide range of downstream applications, the distributions of these generated images are still not fully understood, especially when it comes to the potential stereotypical attributes of different genders. In this work, we propose a paradigm (Gender Presentation Differences) that utilizes fine-grained self-presentation attributes to study how gender is presented differently in text-to-image models. By probing gender indicators in the input text (e.g., "a woman" or "a man"), we quantify the frequency differences of presentation-centric attributes (e.g., "a shirt" and "a dress") through human annotation and introduce a novel metric: GEP. Furthermore, we propose an automatic method to estimate such differences. The automatic GEP metric based on our approach yields a higher correlation with human annotations than that based on existing CLIP scores, consistently across three state-of-the-art text-to-image models. Finally, we demonstrate the generalization ability of our metrics in the context of gender stereotypes related to occupations.

Cross-lists for Wed, 8 Feb 23

[31]  arXiv:2302.03033 (cross-list from eess.IV) [pdf, other]
Title: Exemplars and Counterexemplars Explanations for Image Classifiers, Targeting Skin Lesion Labeling
Comments: arXiv admin note: text overlap with arXiv:2111.11863
Journal-ref: 2021 IEEE Symposium on Computers and Communications (ISCC)
Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Explainable AI consists in developing mechanisms allowing for an interaction between decision systems and humans by making the decisions of the formers understandable. This is particularly important in sensitive contexts like in the medical domain. We propose a use case study, for skin lesion diagnosis, illustrating how it is possible to provide the practitioner with explanations on the decisions of a state of the art deep neural network classifier trained to characterize skin lesions from examples. Our framework consists of a trained classifier onto which an explanation module operates. The latter is able to offer the practitioner exemplars and counterexemplars for the classification diagnosis thus allowing the physician to interact with the automatic diagnosis system. The exemplars are generated via an adversarial autoencoder. We illustrate the behavior of the system on representative examples.

[32]  arXiv:2302.03130 (cross-list from cs.LG) [pdf, other]
Title: Spatial Functa: Scaling Functa to ImageNet Classification and Generation
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Neural fields, also known as implicit neural representations, have emerged as a powerful means to represent complex signals of various modalities. Based on this Dupont et al. (2022) introduce a framework that views neural fields as data, termed *functa*, and proposes to do deep learning directly on this dataset of neural fields. In this work, we show that the proposed framework faces limitations when scaling up to even moderately complex datasets such as CIFAR-10. We then propose *spatial functa*, which overcome these limitations by using spatially arranged latent representations of neural fields, thereby allowing us to scale up the approach to ImageNet-1k at 256x256 resolution. We demonstrate competitive performance to Vision Transformers (Steiner et al., 2022) on classification and Latent Diffusion (Rombach et al., 2022) on image generation respectively.

[33]  arXiv:2302.03193 (cross-list from cs.LG) [pdf, other]
Title: On the Ideal Number of Groups for Isometric Gradient Propagation
Comments: 10 pages, 2 figures
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Recently, various normalization layers have been proposed to stabilize the training of deep neural networks. Among them, group normalization is a generalization of layer normalization and instance normalization by allowing a degree of freedom in the number of groups it uses. However, to determine the optimal number of groups, trial-and-error-based hyperparameter tuning is required, and such experiments are time-consuming. In this study, we discuss a reasonable method for setting the number of groups. First, we find that the number of groups influences the gradient behavior of the group normalization layer. Based on this observation, we derive the ideal number of groups, which calibrates the gradient scale to facilitate gradient descent optimization. Our proposed number of groups is theoretically grounded, architecture-aware, and can provide a proper value in a layer-wise manner for all layers. The proposed method exhibited improved performance over existing methods in numerous neural network architectures, tasks, and datasets.

[34]  arXiv:2302.03285 (cross-list from eess.IV) [pdf, other]
Title: Improving CT Image Segmentation Accuracy Using StyleGAN Driven Data Augmentation
Comments: 17th International Meeting on Fully Three-Dimensional Image Reconstruction in Radiology and Nuclear Medicine(Fully3D Conference)
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Medical Image Segmentation is a useful application for medical image analysis including detecting diseases and abnormalities in imaging modalities such as MRI, CT etc. Deep learning has proven to be promising for this task but usually has a low accuracy because of the lack of appropriate publicly available annotated or segmented medical datasets. In addition, the datasets that are available may have a different texture because of different dosage values or scanner properties than the images that need to be segmented. This paper presents a StyleGAN-driven approach for segmenting publicly available large medical datasets by using readily available extremely small annotated datasets in similar modalities. The approach involves augmenting the small segmented dataset and eliminating texture differences between the two datasets. The dataset is augmented by being passed through six different StyleGANs that are trained on six different style images taken from the large non-annotated dataset we want to segment. Specifically, style transfer is used to augment the training dataset. The annotations of the training dataset are hence combined with the textures of the non-annotated dataset to generate new anatomically sound images. The augmented dataset is then used to train a U-Net segmentation network which displays a significant improvement in the segmentation accuracy in segmenting the large non-annotated dataset.

[35]  arXiv:2302.03296 (cross-list from eess.IV) [pdf, other]
Title: Multi-organ segmentation: a progressive exploration of learning paradigms under scarce annotation
Comments: 23 pages, 4 figures, 5 tables
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Precise delineation of multiple organs or abnormal regions in the human body from medical images plays an essential role in computer-aided diagnosis, surgical simulation, image-guided interventions, and especially in radiotherapy treatment planning. Thus, it is of great significance to explore automatic segmentation approaches, among which deep learning-based approaches have evolved rapidly and witnessed remarkable progress in multi-organ segmentation. However, obtaining an appropriately sized and fine-grained annotated dataset of multiple organs is extremely hard and expensive. Such scarce annotation limits the development of high-performance multi-organ segmentation models but promotes many annotation-efficient learning paradigms. Among these, studies on transfer learning leveraging external datasets, semi-supervised learning using unannotated datasets and partially-supervised learning integrating partially-labeled datasets have led the dominant way to break such dilemma in multi-organ segmentation. We first review the traditional fully supervised method, then present a comprehensive and systematic elaboration of the 3 abovementioned learning paradigms in the context of multi-organ segmentation from both technical and methodological perspectives, and finally summarize their challenges and future trends.

[36]  arXiv:2302.03299 (cross-list from eess.IV) [pdf, other]
Title: 3D Vessel Segmentation with Limited Guidance of 2D Structure-agnostic Vessel Annotations
Comments: Submitted to IEEE TMI Journal
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Delineating 3D blood vessels is essential for clinical diagnosis and treatment, however, is challenging due to complex structure variations and varied imaging conditions. Supervised deep learning has demonstrated its superior capacity in automatic 3D vessel segmentation. However, the reliance on expensive 3D manual annotations and limited capacity for annotation reuse hinder the clinical applications of supervised models. To avoid the repetitive and laborious annotating and make full use of existing vascular annotations, this paper proposes a novel 3D shape-guided local discrimination model for 3D vascular segmentation under limited guidance from public 2D vessel annotations. The primary hypothesis is that 3D vessels are composed of semantically similar voxels and exhibit tree-shaped morphology. Accordingly, the 3D region discrimination loss is firstly proposed to learn the discriminative representation measuring voxel-wise similarities and cluster semantically consistent voxels to form the candidate 3D vascular segmentation in unlabeled images; secondly, based on the similarity of the tree-shaped morphology between 2D and 3D vessels, the Crop-and-Overlap strategy is presented to generate reference masks from 2D structure-agnostic vessel annotations, which are fit for varied vascular structures, and the adversarial loss is introduced to guide the tree-shaped morphology of 3D vessels; thirdly, the temporal consistency loss is proposed to foster the training stability and keep the model updated smoothly. To further enhance the model's robustness and reliability, the orientation-invariant CNN module and Reliability-Refinement algorithm are presented. Experimental results from the public 3D cerebrovascular and 3D arterial tree datasets demonstrate that our model achieves comparable effectiveness against nine supervised models.

[37]  arXiv:2302.03453 (cross-list from eess.IV) [pdf, other]
Title: OSRT: Omnidirectional Image Super-Resolution with Distortion-aware Transformer
Comments: main paper + supplement
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Omnidirectional images (ODIs) have obtained lots of research interest for immersive experiences. Although ODIs require extremely high resolution to capture details of the entire scene, the resolutions of most ODIs are insufficient. Previous methods attempt to solve this issue by image super-resolution (SR) on equirectangular projection (ERP) images. However, they omit geometric properties of ERP in the degradation process, and their models can hardly generalize to real ERP images. In this paper, we propose Fisheye downsampling, which mimics the real-world imaging process and synthesizes more realistic low-resolution samples. Then we design a distortion-aware Transformer (OSRT) to modulate ERP distortions continuously and self-adaptively. Without a cumbersome process, OSRT outperforms previous methods by about 0.2dB on PSNR. Moreover, we propose a convenient data augmentation strategy, which synthesizes pseudo ERP images from plain images. This simple strategy can alleviate the over-fitting problem of large networks and significantly boost the performance of ODISR. Extensive experiments have demonstrated the state-of-the-art performance of our OSRT. Codes and models will be available at https://github.com/Fanghua-Yu/OSRT.

[38]  arXiv:2302.03473 (cross-list from eess.IV) [pdf, other]
Title: Med-NCA: Robust and Lightweight Segmentation with Neural Cellular Automata
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Access to the proper infrastructure is critical when performing medical image segmentation with Deep Learning. This requirement makes it difficult to run state-of-the-art segmentation models in resource-constrained scenarios like primary care facilities in rural areas and during crises. The recently emerging field of Neural Cellular Automata (NCA) has shown that locally interacting one-cell models can achieve competitive results in tasks such as image generation or segmentations in low-resolution inputs. However, they are constrained by high VRAM requirements and the difficulty of reaching convergence for high-resolution images. To counteract these limitations we propose Med-NCA, an end-to-end NCA training pipeline for high-resolution image segmentation. Our method follows a two-step process. Global knowledge is first communicated between cells across the downscaled image. Following that, patch-based segmentation is performed. Our proposed Med-NCA outperforms the classic UNet by 2% and 3% Dice for hippocampus and prostate segmentation, respectively, while also being 500 times smaller. We also show that Med-NCA is by design invariant with respect to image scale, shape and translation, experiencing only slight performance degradation even with strong shifts; and is robust against MRI acquisition artefacts. Med-NCA enables high-resolution medical image segmentation even on a Raspberry Pi B+, arguably the smallest device able to run PyTorch and that can be powered by a standard power bank.

[39]  arXiv:2302.03476 (cross-list from eess.IV) [pdf, other]
Title: VertXNet: An Ensemble Method for Vertebrae Segmentation and Identification of Spinal X-Ray
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Reliable vertebrae annotations are key to perform analysis of spinal X-ray images. However, obtaining annotation of vertebrae from those images is usually carried out manually due to its complexity (i.e. small structures with varying shape), making it a costly and tedious process. To accelerate this process, we proposed an ensemble pipeline, VertXNet, that combines two state-of-the-art (SOTA) segmentation models (respectively U-Net and Mask R-CNN) to automatically segment and label vertebrae in X-ray spinal images. Moreover, VertXNet introduces a rule-based approach that allows to robustly infer vertebrae labels (by locating the 'reference' vertebrae which are easier to segment than others) for a given spinal X-ray image. We evaluated the proposed pipeline on three spinal X-ray datasets (two internal and one publicly available), and compared against vertebrae annotated by radiologists. Our experimental results have shown that the proposed pipeline outperformed two SOTA segmentation models on our test dataset (MEASURE 1) with a mean Dice of 0.90, vs. a mean Dice of 0.73 for Mask R-CNN and 0.72 for U-Net. To further evaluate the generalization ability of VertXNet, the pre-trained pipeline was directly tested on two additional datasets (PREVENT and NHANES II) and consistent performance was observed with a mean Dice of 0.89 and 0.88, respectively. Overall, VertXNet demonstrated significantly improved performance for vertebra segmentation and labeling for spinal X-ray imaging, and evaluation on both in-house clinical trial data and publicly available data further proved its generalization.

[40]  arXiv:2302.03537 (cross-list from eess.IV) [pdf, other]
Title: Aligning Multi-Sequence CMR Towards Fully Automated Myocardial Pathology Segmentation
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Myocardial pathology segmentation (MyoPS) is critical for the risk stratification and treatment planning of myocardial infarction (MI). Multi-sequence cardiac magnetic resonance (MS-CMR) images can provide valuable information. For instance, balanced steady-state free precession cine sequences present clear anatomical boundaries, while late gadolinium enhancement and T2-weighted CMR sequences visualize myocardial scar and edema of MI, respectively. Existing methods usually fuse anatomical and pathological information from different CMR sequences for MyoPS, but assume that these images have been spatially aligned. However, MS-CMR images are usually unaligned due to the respiratory motions in clinical practices, which poses additional challenges for MyoPS. This work presents an automatic MyoPS framework for unaligned MS-CMR images. Specifically, we design a combined computing model for simultaneous image registration and information fusion, which aggregates multi-sequence features into a common space to extract anatomical structures (i.e., myocardium). Consequently, we can highlight the informative regions in the common space via the extracted myocardium to improve MyoPS performance, considering the spatial relationship between myocardial pathologies and myocardium. Experiments on a private MS-CMR dataset and a public dataset from the MYOPS2020 challenge show that our framework could achieve promising performance for fully automatic MyoPS.

[41]  arXiv:2302.03570 (cross-list from eess.IV) [pdf, other]
Title: A Deep Learning-based in silico Framework for Optimization on Retinal Prosthetic Stimulation
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

We propose a neural network-based framework to optimize the perceptions simulated by the in silico retinal implant model pulse2percept. The overall pipeline consists of a trainable encoder, a pre-trained retinal implant model and a pre-trained evaluator. The encoder is a U-Net, which takes the original image and outputs the stimulus. The pre-trained retinal implant model is also a U-Net, which is trained to mimic the biomimetic perceptual model implemented in pulse2percept. The evaluator is a shallow VGG classifier, which is trained with original images. Based on 10,000 test images from the MNIST dataset, we show that the convolutional neural network-based encoder performs significantly better than the trivial downsampling approach, yielding a boost in the weighted F1-Score by 36.17% in the pre-trained classifier with 6x10 electrodes. With this fully neural network-based encoder, the quality of the downstream perceptions can be fine-tuned using gradient descent in an end-to-end fashion.

[42]  arXiv:2302.03573 (cross-list from cs.RO) [pdf, other]
Title: Local Neural Descriptor Fields: Locally Conditioned Object Representations for Manipulation
Comments: ICRA 2023, Project Page: this https URL
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

A robot operating in a household environment will see a wide range of unique and unfamiliar objects. While a system could train on many of these, it is infeasible to predict all the objects a robot will see. In this paper, we present a method to generalize object manipulation skills acquired from a limited number of demonstrations, to novel objects from unseen shape categories. Our approach, Local Neural Descriptor Fields (L-NDF), utilizes neural descriptors defined on the local geometry of the object to effectively transfer manipulation demonstrations to novel objects at test time. In doing so, we leverage the local geometry shared between objects to produce a more general manipulation framework. We illustrate the efficacy of our approach in manipulating novel objects in novel poses -- both in simulation and in the real world.

[43]  arXiv:2302.03609 (cross-list from astro-ph.EP) [pdf, other]
Title: Pole Estimation and Optical Navigation using Circle of Latitude Projections
Subjects: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)

Images of both rotating celestial bodies (e.g., asteroids) and spheroidal planets with banded atmospheres (e.g., Jupiter) can contain features that are well-modeled as a circle of latitude (CoL). The projections of these CoLs appear as ellipses in images collected by cameras or telescopes onboard exploration spacecraft. This work shows how CoL projections may be used to determine the pole orientation and covariance for a spinning asteroid. In the case of a known planet modeled as an oblate spheroid, it is shown how similar CoL projections may be used for spacecraft localization. These methods are developed using the principles of projective geometry. Numerical results are provided for simulated images of asteroid Bennu (for pole orientation) and of Jupiter (for spacecraft localization).

[44]  arXiv:2302.03679 (cross-list from cs.LG) [pdf, other]
Title: How Reliable is Your Regression Model's Uncertainty Under Real-World Distribution Shifts?
Comments: Code is available at this https URL
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Many important computer vision applications are naturally formulated as regression problems. Within medical imaging, accurate regression models have the potential to automate various tasks, helping to lower costs and improve patient outcomes. Such safety-critical deployment does however require reliable estimation of model uncertainty, also under the wide variety of distribution shifts that might be encountered in practice. Motivated by this, we set out to investigate the reliability of regression uncertainty estimation methods under various real-world distribution shifts. To that end, we propose an extensive benchmark of 8 image-based regression datasets with different types of challenging distribution shifts. We then employ our benchmark to evaluate many of the most common uncertainty estimation methods, as well as two state-of-the-art uncertainty scores from the task of out-of-distribution detection. We find that while methods are well calibrated when there is no distribution shift, they all become highly overconfident on many of the benchmark datasets. This uncovers important limitations of current uncertainty estimation methods, and the proposed benchmark therefore serves as a challenge to the research community. We hope that our benchmark will spur more work on how to develop truly reliable regression uncertainty estimation methods. Code is available at https://github.com/fregu856/regression_uncertainty.

[45]  arXiv:2302.03689 (cross-list from cs.LG) [pdf, other]
Title: PartitionVAE -- a human-interpretable VAE
Comments: 13 pages, 18 figures
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

VAEs, or variational autoencoders, are autoencoders that explicitly learn the distribution of the input image space rather than assuming no prior information about the distribution. This allows it to classify similar samples close to each other in the latent space's distribution. VAEs classically assume the latent space is normally distributed, though many distribution priors work, and they encode this assumption through a K-L divergence term in the loss function. While VAEs learn the distribution of the latent space and naturally make each dimension in the latent space as disjoint from the others as possible, they do not group together similar features -- the image space feature represented by one unit of the representation layer does not necessarily have high correlation with the feature represented by a neighboring unit of the representation layer. This makes it difficult to interpret VAEs since the representation layer is not structured in a way that is easy for humans to parse. We aim to make a more interpretable VAE by partitioning the representation layer into disjoint sets of units. Partitioning the representation layer into disjoint sets of interconnected units yields a prior that features of the input space to this new VAE, which we call a partition VAE or PVAE, are grouped together by correlation -- for example, if our image space were the space of all ping ping game images (a somewhat complex image space we use to test our architecture) then we would hope the partitions in the representation layer each learned some large feature of the image like the characteristics of the ping pong table or the characteristics and position of the players or the ball. We also add to the PVAE a cost-saving measure: subresolution. Because we do not have access to GPU training environments for long periods of time and Google Colab Pro costs money, we attempt to decrease the complexity of the PVAE by outputting an image with dimensions scaled down from the input image by a constant factor, thus forcing the model to output a smaller version of the image. We then increase the resolution to calculate loss and train by interpolating through neighboring pixels. We train a tuned PVAE on MNIST and Sports10 to test its effectiveness.

Replacements for Wed, 8 Feb 23

[46]  arXiv:1912.11164 (replaced) [pdf, ps, other]
Title: Unsupervised Scene Adaptation with Memory Regularization in vivo
Comments: 7 pages, 4 figures, 6 tables (accepted by IJCAI 2020)
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[47]  arXiv:2106.07085 (replaced) [pdf, other]
Title: Survey: Image Mixing and Deleting for Data Augmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[48]  arXiv:2111.12924 (replaced) [pdf, other]
Title: Joint stereo 3D object detection and implicit surface reconstruction
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
[49]  arXiv:2203.07720 (replaced) [pdf, other]
Title: Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[50]  arXiv:2204.03916 (replaced) [pdf, ps, other]
Title: A Survey of Supernet Optimization and its Applications: Spatial and Temporal Optimization for Neural Architecture Search
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[51]  arXiv:2205.07030 (replaced) [pdf, other]
Title: Realistic Defocus Blur for Multiplane Computer-Generated Holography
Comments: 16 pages in total, first 9 pages are for the manuscript, remaining pages are for supplementary. For more visit: this https URL For our codebase visit this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
[52]  arXiv:2205.13349 (replaced) [pdf, other]
Title: Learning What and Where: Disentangling Location and Identity Tracking Without Supervision
Comments: Accepted at ICLR 2023
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[53]  arXiv:2206.10690 (replaced) [pdf, other]
Title: Learning Continuous Rotation Canonicalization with Radial Beam Sampling
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[54]  arXiv:2207.04220 (replaced) [pdf, other]
Title: Rethinking Persistent Homology for Visual Recognition
Comments: ICML 2022 Workshop on Topology, Algebra, and Geometry in Machine Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[55]  arXiv:2207.05515 (replaced) [pdf, other]
Title: Compound Prototype Matching for Few-shot Action Recognition
Comments: ECCV 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[56]  arXiv:2207.07253 (replaced) [pdf, other]
Title: Single Shot Self-Reliant Scene Text Spotter by Decoupled yet Collaborative Detection and Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[57]  arXiv:2209.06994 (replaced) [pdf, ps, other]
Title: PriorLane: A Prior Knowledge Enhanced Lane Detection Approach Based on Transformer
Comments: Accepted by ICRA 2023
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[58]  arXiv:2209.07126 (replaced) [pdf, other]
Title: Forgetting to Remember: A Scalable Incremental Learning Framework for Cross-Task Blind Image Quality Assessment
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[59]  arXiv:2209.15517 (replaced) [pdf, other]
Title: Medical Image Understanding with Pretrained Vision Language Models: A Comprehensive Study
Comments: Accepted to ICLR2023
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[60]  arXiv:2210.07983 (replaced) [pdf, other]
Title: Improving Transfer Learning with a Dual Image and Video Transformer for Multi-label Movie Trailer Genre Classification
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[61]  arXiv:2210.11582 (replaced) [pdf, other]
Title: Deep Learning for Diagonal Earlobe Crease Detection
Comments: Accepted at 12th International Conference on Pattern Recognition Applications (ICPRAM 2023)
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[62]  arXiv:2211.03919 (replaced) [pdf, other]
Title: ShaSTA: Modeling Shape and Spatio-Temporal Affinities for 3D Multi-Object Tracking
Comments: 10 pages, 3 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[63]  arXiv:2211.12109 (replaced) [pdf, other]
Title: Video compression dataset and benchmark of learning-based video-quality metrics
Comments: 10 pages, 4 figures, 6 tables, 1 supplementary material
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[64]  arXiv:2211.13726 (replaced) [pdf, other]
Title: Lightweight Event-based Optical Flow Estimation via Iterative Deblurring
Comments: Added supplementary materials
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[65]  arXiv:2212.02277 (replaced) [pdf]
Title: R2FD2: Fast and Robust Matching of Multimodal Remote Sensing Image via Repeatable Feature Detector and Rotation-invariant Feature Descriptor
Comments: 33 pages, 15 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[66]  arXiv:2212.03434 (replaced) [pdf, other]
Title: Name Your Colour For the Task: Artificially Discover Colour Naming via Colour Quantisation Transformer
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[67]  arXiv:2212.08892 (replaced) [pdf, other]
Title: Flattening-Net: Deep Regular 2D Representation for 3D Point Cloud Analysis
Comments: Accepted to TPAMI
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[68]  arXiv:2212.10772 (replaced) [pdf, other]
Title: Low-Light Image and Video Enhancement: A Comprehensive Survey and Beyond
Comments: 21 pages, 9 tables, and 25 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[69]  arXiv:2301.08414 (replaced) [pdf, other]
Title: FG-Depth: Flow-Guided Unsupervised Monocular Depth Estimation
Comments: Accepted by ICRA2023
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[70]  arXiv:2302.00912 (replaced) [pdf]
Title: Advances and Challenges in Multimodal Remote Sensing Image Registration
Comments: 10 pages, 4 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[71]  arXiv:2302.01735 (replaced) [pdf, other]
Title: Rethinking Semi-Supervised Medical Image Segmentation: A Variance-Reduction Perspective
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[72]  arXiv:2302.02088 (replaced) [pdf, other]
Title: AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[73]  arXiv:2302.02150 (replaced) [pdf]
Title: This Intestine Does Not Exist: Multiscale Residual Variational Autoencoder for Realistic Wireless Capsule Endoscopy Image Generation
Comments: 10 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[74]  arXiv:2302.02367 (replaced) [pdf, other]
Title: FastPillars: A Deployment-friendly Pillar-based 3D Detector
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[75]  arXiv:2302.02394 (replaced) [pdf, other]
Title: Eliminating Prior Bias for Semantic Image Editing via Dual-Cycle Diffusion
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[76]  arXiv:2302.02550 (replaced) [pdf, other]
Title: Domain Re-Modulation for Few-Shot Generative Domain Adaptation
Comments: Under Review
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[77]  arXiv:2302.02551 (replaced) [pdf, other]
Title: CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[78]  arXiv:2302.02693 (replaced) [pdf, other]
Title: PatchDCT: Patch Refinement for High Quality Instance Segmentation
Comments: 15 pages, 7 figures, 13 tables, accepted by ICLR 2023, the source code is available at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[79]  arXiv:2201.09929 (replaced) [pdf, other]
Title: Euclidean and Affine Curve Reconstruction
Comments: This paper is a result of an REU project conducted at the North Carolina State University in the Summer and Fall 2020. This version, with improved quality of presentation and figures, is accepted to "Involve" this https URL
Subjects: Differential Geometry (math.DG); Computer Vision and Pattern Recognition (cs.CV)
[80]  arXiv:2205.00415 (replaced) [pdf, other]
Title: Don't Blame the Annotator: Bias Already Starts in the Annotation Instructions
Comments: Accepted to EACL 2023
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[81]  arXiv:2206.14502 (replaced) [pdf, other]
Title: RegMixup: Mixup as a Regularizer Can Surprisingly Improve Accuracy and Out Distribution Robustness
Comments: 22 pages, 18 figures
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[82]  arXiv:2208.12625 (replaced) [pdf, other]
Title: Take One Gram of Neural Features, Get Enhanced Group Robustness
Comments: Long version (Previous version: OOD-CV Workshop @ ECCV 2022)
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[83]  arXiv:2209.08803 (replaced) [pdf, other]
Title: Zero-shot Active Visual Search (ZAVIS): Intelligent Object Search for Robotic Assistants
Comments: To be appear on ICRA 2023
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
[84]  arXiv:2210.01908 (replaced) [pdf, other]
Title: Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[85]  arXiv:2212.13425 (replaced) [pdf, other]
Title: GEDI: GEnerative and DIscriminative Training for Self-Supervised Learning
Comments: Fixed typos/cleaned the experimental section
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
[86]  arXiv:2301.05345 (replaced) [pdf, other]
Title: GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous Structured Pruning for Vision Transformer
Comments: This manuscript was accepted to AAAI 2023 Main Track
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
[87]  arXiv:2301.11482 (replaced) [src]
Title: Diffusion Denoising for Low-Dose-CT Model
Authors: Runyi Li
Comments: The method and experiment of this paper has some error, and we need to revise it
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[88]  arXiv:2302.02524 (replaced) [pdf]
Title: Novel Fundus Image Preprocessing for Retcam Images to Improve Deep Learning Classification of Retinopathy of Prematurity
Comments: 10 pages, 4 figures, 7 tables. arXiv admin note: text overlap with arXiv:1904.08796 by other authors
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[89]  arXiv:2302.03003 (replaced) [pdf, other]
Title: OTRE: Where Optimal Transport Guided Unpaired Image-to-Image Translation Meets Regularization by Enhancing
Comments: Accepted as a conference paper to The 28th biennial international conference on Information Processing in Medical Imaging (IPMI 2023)
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[ total of 89 entries: 1-89 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2302, contact, help  (Access key information)