We gratefully acknowledge support from
the Simons Foundation and member institutions.

Computer Vision and Pattern Recognition

New submissions

[ total of 76 entries: 1-76 ]
[ showing up to 500 entries per page: fewer | more ]

New submissions for Fri, 24 Jun 22

[1]  arXiv:2206.11352 [pdf, ps, other]
Title: Doubly Reparameterized Importance Weighted Structure Learning for Scene Graph Generation
Comments: arXiv admin note: substantial text overlap with arXiv:2205.07017
Subjects: Computer Vision and Pattern Recognition (cs.CV)

As a structured prediction task, scene graph generation, given an input image, aims to explicitly model objects and their relationships by constructing a visually-grounded scene graph. In the current literature, such task is universally solved via a message passing neural network based mean field variational Bayesian methodology. The classical loose evidence lower bound is generally chosen as the variational inference objective, which could induce oversimplified variational approximation and thus underestimate the underlying complex posterior. In this paper, we propose a novel doubly reparameterized importance weighted structure learning method, which employs a tighter importance weighted lower bound as the variational inference objective. It is computed from multiple samples drawn from a reparameterizable Gumbel-Softmax sampler and the resulting constrained variational inference task is solved by a generic entropic mirror descent algorithm. The resulting doubly reparameterized gradient estimator reduces the variance of the corresponding derivatives with a beneficial impact on learning. The proposed method achieves the state-of-the-art performance on various popular scene graph generation benchmarks.

[2]  arXiv:2206.11358 [pdf, other]
Title: Monocular Spherical Depth Estimation with Explicitly Connected Weak Layout Cues
Comments: Project page at this https URL
Journal-ref: ISPRS Journal of Photogrammetry and Remote Sensing, Volume 183, January 2022, Pages 269-285
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Spherical cameras capture scenes in a holistic manner and have been used for room layout estimation. Recently, with the availability of appropriate datasets, there has also been progress in depth estimation from a single omnidirectional image. While these two tasks are complementary, few works have been able to explore them in parallel to advance indoor geometric perception, and those that have done so either relied on synthetic data, or used small scale datasets, as few options are available that include both layout annotations and dense depth maps in real scenes. This is partly due to the necessity of manual annotations for room layouts. In this work, we move beyond this limitation and generate a 360 geometric vision (360V) dataset that includes multiple modalities, multi-view stereo data and automatically generated weak layout cues. We also explore an explicit coupling between the two tasks to integrate them into a singleshot trained model. We rely on depth-based layout reconstruction and layout-based depth attention, demonstrating increased performance across both tasks. By using single 360 cameras to scan rooms, the opportunity for facile and quick building-scale 3D scanning arises.

[3]  arXiv:2206.11404 [pdf, other]
Title: The ArtBench Dataset: Benchmarking Generative Models with Artworks
Comments: The first two authors contributed equally to this work. The code and data are available at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

We introduce ArtBench-10, the first class-balanced, high-quality, cleanly annotated, and standardized dataset for benchmarking artwork generation. It comprises 60,000 images of artwork from 10 distinctive artistic styles, with 5,000 training images and 1,000 testing images per style. ArtBench-10 has several advantages over previous artwork datasets. Firstly, it is class-balanced while most previous artwork datasets suffer from the long tail class distributions. Secondly, the images are of high quality with clean annotations. Thirdly, ArtBench-10 is created with standardized data collection, annotation, filtering, and preprocessing procedures. We provide three versions of the dataset with different resolutions ($32\times32$, $256\times256$, and original image size), formatted in a way that is easy to be incorporated by popular machine learning frameworks. We also conduct extensive benchmarking experiments using representative image synthesis models with ArtBench-10 and present in-depth analysis. The dataset is available at https://github.com/liaopeiyuan/artbench under a Fair Use license.

[4]  arXiv:2206.11428 [pdf, other]
Title: LidarMutliNet: Unifying LiDAR Semantic Segmentation, 3D Object Detection, and Panoptic Segmentation in a Single Multi-task Network
Comments: Official 1st Place Solution for the Waymo Open Dataset Challenges 2022 - 3D Semantic Segmentation. Official leaderboard: this https URL CVPR 2022 Workshop on Autonomous Driving: this http URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This technical report presents the 1st place winning solution for the Waymo Open Dataset 3D semantic segmentation challenge 2022. Our network, termed LidarMultiNet, unifies the major LiDAR perception tasks such as 3D semantic segmentation, object detection, and panoptic segmentation in a single framework. At the core of LidarMultiNet is a strong 3D voxel-based encoder-decoder network with a novel Global Context Pooling (GCP) module extracting global contextual features from a LiDAR frame to complement its local features. An optional second stage is proposed to refine the first-stage segmentation or generate accurate panoptic segmentation results. Our solution achieves a mIoU of 71.13 and is the best for most of the 22 classes on the Waymo 3D semantic segmentation test set, outperforming all the other 3D semantic segmentation methods on the official leaderboard. We demonstrate for the first time that major LiDAR perception tasks can be unified in a single strong network that can be trained end-to-end.

[5]  arXiv:2206.11443 [pdf, other]
Title: Image-based Stability Quantification
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Quantitative evaluation of human stability using foot pressure/force measurement hardware and motion capture (mocap) technology is expensive, time consuming, and restricted to the laboratory (lab-based). We propose a novel image-based method to estimate three key components for stability computation: Center of Mass (CoM), Base of Support (BoS), and Center of Pressure (CoP). Furthermore, we quantitatively validate our image-based methods for computing two classic stability measures against the ones generated directly from lab-based sensory output (ground truth) using a publicly available multi-modality (mocap, foot pressure, 2-view videos), ten-subject human motion dataset. Using leave-one-subject-out cross validation, our experimental results show: 1) our CoM estimation method (CoMNet) consistently outperforms state-of-the-art inertial sensor-based CoM estimation techniques; 2) our image-based method combined with insole foot-pressure alone produces consistent and statistically significant correlation with ground truth stability measures (CoMtoCoP R=0.79 P<0.001, CoMtoBoS R=0.75 P<0.001); 3) our fully image-based stability metric estimation produces consistent, positive, and statistically significant correlation on the two stability metrics (CoMtoCoP R=0.31 P<0.001, CoMtoBoS R=0.22 P<0.001). Our study provides promising quantitative evidence for stability computations and monitoring in natural environments.

[6]  arXiv:2206.11459 [pdf, other]
Title: Explore Spatio-temporal Aggregation for Insubstantial Object Detection: Benchmark Dataset and Baseline
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We endeavor on a rarely explored task named Insubstantial Object Detection (IOD), which aims to localize the object with following characteristics: (1) amorphous shape with indistinct boundary; (2) similarity to surroundings; (3) absence in color. Accordingly, it is far more challenging to distinguish insubstantial objects in a single static frame and the collaborative representation of spatial and temporal information is crucial. Thus, we construct an IOD-Video dataset comprised of 600 videos (141,017 frames) covering various distances, sizes, visibility, and scenes captured by different spectral ranges. In addition, we develop a spatio-temporal aggregation framework for IOD, in which different backbones are deployed and a spatio-temporal aggregation loss (STAloss) is elaborately designed to leverage the consistency along the time axis. Experiments conducted on IOD-Video dataset demonstrate that spatio-temporal aggregation can significantly improve the performance of IOD. We hope our work will attract further researches into this valuable yet challenging task. The code will be available at: \url{https://github.com/CalayZhou/IOD-Video}.

[7]  arXiv:2206.11462 [pdf, ps, other]
Title: ICME 2022 Few-shot LOGO detection top 9 solution
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

ICME-2022 few-shot logo detection competition is held in May, 2022. Participants are required to develop a single model to detect logos by handling tiny logo instances, similar brands, and adversarial images at the same time, with limited annotations. Our team achieved rank 16 and 11 in the first and second round of the competition respectively, with a final rank of 9th. This technical report summarized our major techniques used in this competitions, and potential improvement.

[8]  arXiv:2206.11473 [pdf, other]
Title: Complementary datasets to COCO for object detection
Authors: Ali Borji
Subjects: Computer Vision and Pattern Recognition (cs.CV)

For nearly a decade, the COCO dataset has been the central test bed of research in object detection. According to the recent benchmarks, however, it seems that performance on this dataset has started to saturate. One possible reason can be that perhaps it is not large enough for training deep models. To address this limitation, here we introduce two complementary datasets to COCO: i) COCO_OI, composed of images from COCO and OpenImages (from their 80 classes in common) with 1,418,978 training bounding boxes over 380,111 images, and 41,893 validation bounding boxes over 18,299 images, and ii) ObjectNet_D containing objects in daily life situations (originally created for object recognition known as ObjectNet; 29 categories in common with COCO). The latter can be used to test the generalization ability of object detectors. We evaluate some models on these datasets and pinpoint the source of errors. We encourage the community to utilize these datasets for training and testing object detection models. Code and data is available at https://github.com/aliborji/COCO_OI.

[9]  arXiv:2206.11474 [pdf, other]
Title: Entropy-driven Sampling and Training Scheme for Conditional Diffusion Generation
Comments: 24 pages, 8 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Denoising Diffusion Probabilistic Model (DDPM) is able to make flexible conditional image generation from prior noise to real data, by introducing an independent noise-aware classifier to provide conditional gradient guidance at each time step of denoising process. However, due to the ability of classifier to easily discriminate an incompletely generated image only with high-level structure, the gradient, which is a kind of class information guidance, tends to vanish early, leading to the collapse from conditional generation process into the unconditional process. To address this problem, we propose two simple but effective approaches from two perspectives. For sampling procedure, we introduce the entropy of predicted distribution as the measure of guidance vanishing level and propose an entropy-aware scaling method to adaptively recover the conditional semantic guidance. % for each generated sample. For training stage, we propose the entropy-aware optimization objectives to alleviate the overconfident prediction for noisy data.On ImageNet1000 256x256, with our proposed sampling scheme and trained classifier, the pretrained conditional and unconditional DDPM model can achieve 10.89% (4.59 to 4.09) and 43.5% (12 to 6.78) FID improvement respectively.

[10]  arXiv:2206.11476 [pdf]
Title: Dynamic Scene Deblurring Base on Continuous Cross-Layer Attention Transmission
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The deep convolutional neural networks (CNNs) using attention mechanism have achieved great success for dynamic scene deblurring. In most of these networks, only the features refined by the attention maps can be passed to the next layer and the attention maps of different layers are separated from each other, which does not make full use of the attention information from different layers in the CNN. To address this problem, we introduce a new continuous cross-layer attention transmission (CCLAT) mechanism that can exploit hierarchical attention information from all the convolutional layers. Based on the CCLAT mechanism, we use a very simple attention module to construct a novel residual dense attention fusion block (RDAFB). In RDAFB, the attention maps inferred from the outputs of the preceding RDAFB and each layer are directly connected to the subsequent ones, leading to a CRLAT mechanism. Taking RDAFB as the building block, we design an effective architecture for dynamic scene deblurring named RDAFNet. The experiments on benchmark datasets show that the proposed model outperforms the state-of-the-art deblurring approaches, and demonstrate the effectiveness of CCLAT mechanism. The source code is available on: https://github.com/xjmz6/RDAFNet.

[11]  arXiv:2206.11493 [pdf, other]
Title: Learning to Refactor Action and Co-occurrence Features for Temporal Action Localization
Comments: Accepted by CVPR 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The main challenge of Temporal Action Localization is to retrieve subtle human actions from various co-occurring ingredients, e.g., context and background, in an untrimmed video. While prior approaches have achieved substantial progress through devising advanced action detectors, they still suffer from these co-occurring ingredients which often dominate the actual action content in videos. In this paper, we explore two orthogonal but complementary aspects of a video snippet, i.e., the action features and the co-occurrence features. Especially, we develop a novel auxiliary task by decoupling these two types of features within a video snippet and recombining them to generate a new feature representation with more salient action information for accurate action localization. We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features, and then synthesizes a new action-dominated video representation. Extensive experimental results and ablation studies on THUMOS14 and ActivityNet v1.3 demonstrate that our new representation, combined with a simple action detector, can significantly improve the action localization performance.

[12]  arXiv:2206.11499 [pdf, other]
Title: Parallel Structure from Motion for UAV Images via Weighted Connected Dominating Set
Comments: 14 pages, 11 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Incremental Structure from Motion (ISfM) has been widely used for UAV image orientation. Its efficiency, however, decreases dramatically due to the sequential constraint. Although the divide-and-conquer strategy has been utilized for efficiency improvement, cluster merging becomes difficult or depends on seriously designed overlap structures. This paper proposes an algorithm to extract the global model for cluster merging and designs a parallel SfM solution to achieve efficient and accurate UAV image orientation. First, based on vocabulary tree retrieval, match pairs are selected to construct an undirected weighted match graph, whose edge weights are calculated by considering both the number and distribution of feature matches. Second, an algorithm, termed weighted connected dominating set (WCDS), is designed to achieve the simplification of the match graph and build the global model, which incorporates the edge weight in the graph node selection and enables the successful reconstruction of the global model. Third, the match graph is simultaneously divided into compact and non-overlapped clusters. After the parallel reconstruction, cluster merging is conducted with the aid of common 3D points between the global and cluster models. Finally, by using three UAV datasets that are captured by classical oblique and recent optimized views photogrammetry, the validation of the proposed solution is verified through comprehensive analysis and comparison. The experimental results demonstrate that the proposed parallel SfM can achieve 17.4 times efficiency improvement and comparative orientation accuracy. In absolute BA, the geo-referencing accuracy is approximately 2.0 and 3.0 times the GSD (Ground Sampling Distance) value in the horizontal and vertical directions, respectively. For parallel SfM, the proposed solution is a more reliable alternative.

[13]  arXiv:2206.11502 [pdf]
Title: A Review of Published Machine Learning Natural Language Processing Applications for Protocolling Radiology Imaging
Authors: Nihal Raju (5), Michael Woodburn (1 and 5), Stefan Kachel (2 and 3), Jack O'Shaughnessy (5), Laurence Sorace (5), Natalie Yang (2), Ruth P Lim (2 and 4) ((1) Harvard University, Extension School, Cambridge, MA, USA, (2) Department of Radiology, The University of Melbourne, Parkville, (3) Department of Radiology, Columbia University in the City of New York, (4) Department of Surgery, Austin, The University of Melbourne, (5) Austin Hospital, Austin Health, Melbourne, Australia)
Comments: 7 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

Machine learning (ML) is a subfield of Artificial intelligence (AI), and its applications in radiology are growing at an ever-accelerating rate. The most studied ML application is the automated interpretation of images. However, natural language processing (NLP), which can be combined with ML for text interpretation tasks, also has many potential applications in radiology. One such application is automation of radiology protocolling, which involves interpreting a clinical radiology referral and selecting the appropriate imaging technique. It is an essential task which ensures that the correct imaging is performed. However, the time that a radiologist must dedicate to protocolling could otherwise be spent reporting, communicating with referrers, or teaching. To date, there have been few publications in which ML models were developed that use clinical text to automate protocol selection. This article reviews the existing literature in this field. A systematic assessment of the published models is performed with reference to best practices suggested by machine learning convention. Progress towards implementing automated protocolling in a clinical setting is discussed.

[14]  arXiv:2206.11520 [pdf, other]
Title: ICOS Protein Expression Segmentation: Can Transformer Networks Give Better Results?
Comments: Accepted MIUA conference (Abstract short paper)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Biomarkers identify a patients response to treatment. With the recent advances in artificial intelligence based on the Transformer networks, there is only limited research has been done to measure the performance on challenging histopathology images. In this paper, we investigate the efficacy of the numerous state-of-the-art Transformer networks for immune-checkpoint biomarker, Inducible Tcell COStimulator (ICOS) protein cell segmentation in colon cancer from immunohistochemistry (IHC) slides. Extensive and comprehensive experimental results confirm that MiSSFormer achieved the highest Dice score of 74.85% than the rest evaluated Transformer and Efficient U-Net methods.

[15]  arXiv:2206.11541 [pdf, other]
Title: A Neuromorphic Vision-Based Measurement for Robust Relative Localization in Future Space Exploration Missions
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Space exploration has witnessed revolutionary changes upon landing of the Perseverance Rover on the Martian surface and demonstrating the first flight beyond Earth by the Mars helicopter, Ingenuity. During their mission on Mars, Perseverance Rover and Ingenuity collaboratively explore the Martian surface, where Ingenuity scouts terrain information for rover's safe traversability. Hence, determining the relative poses between both the platforms is of paramount importance for the success of this mission. Driven by this necessity, this work proposes a robust relative localization system based on a fusion of neuromorphic vision-based measurements (NVBMs) and inertial measurements. The emergence of neuromorphic vision triggered a paradigm shift in the computer vision community, due to its unique working principle delineated with asynchronous events triggered by variations of light intensities occurring in the scene. This implies that observations cannot be acquired in static scenes due to illumination invariance. To circumvent this limitation, high frequency active landmarks are inserted in the scene to guarantee consistent event firing. These landmarks are adopted as salient features to facilitate relative localization. A novel event-based landmark identification algorithm using Gaussian Mixture Models (GMM) is developed for matching the landmarks correspondences formulating our NVBMs. The NVBMs are fused with inertial measurements in proposed state estimators, landmark tracking Kalman filter (LTKF) and translation decoupled Kalman filter (TDKF) for landmark tracking and relative localization, respectively. The proposed system was tested in a variety of experiments and has outperformed state-of-the-art approaches in accuracy and range.

[16]  arXiv:2206.11589 [pdf, other]
Title: Learning Towards the Largest Margins
Comments: ICLR 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

One of the main challenges for feature representation in deep learning-based classification is the design of appropriate loss functions that exhibit strong discriminative power. The classical softmax loss does not explicitly encourage discriminative learning of features. A popular direction of research is to incorporate margins in well-established losses in order to enforce extra intra-class compactness and inter-class separability, which, however, were developed through heuristic means, as opposed to rigorous mathematical principles. In this work, we attempt to address this limitation by formulating the principled optimization objective as learning towards the largest margins. Specifically, we firstly define the class margin as the measure of inter-class separability, and the sample margin as the measure of intra-class compactness. Accordingly, to encourage discriminative representation of features, the loss function should promote the largest possible margins for both classes and samples. Furthermore, we derive a generalized margin softmax loss to draw general conclusions for the existing margin-based losses. Not only does this principled framework offer new perspectives to understand and interpret existing margin-based losses, but it also provides new insights that can guide the design of new tools, including sample margin regularization and largest margin softmax loss for the class-balanced case, and zero-centroid regularization for the class-imbalanced case. Experimental results demonstrate the effectiveness of our strategy on a variety of tasks, including visual classification, imbalanced classification, person re-identification, and face verification.

[17]  arXiv:2206.11610 [pdf, other]
Title: 1st Place Solutions for RxR-Habitat Vision-and-Language Navigation Competition (CVPR 2022)
Comments: Winner of the 2nd RxR-Habitat Competition @ CVPR2022
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

This report presents the methods of the winning entry of the RxR-Habitat Competition in CVPR 2022. The competition addresses the problem of Vision-and-Language Navigation in Continuous Environments (VLN-CE), which requires an agent to follow step-by-step natural language instructions to reach a target. We present a modular plan-and-control approach for the task. Our model consists of three modules: the candidate waypoints predictor (CWP), the history enhanced planner and the tryout controller. In each decision loop, CWP first predicts a set of candidate waypoints based on depth observations from multiple views. It can reduce the complexity of the action space and facilitate planning. Then, a history-enhanced planner is adopted to select one of the candidate waypoints as the subgoal. The planner additionally encodes historical memory to track the navigation progress, which is especially effective for long-horizon navigation. Finally, we propose a non-parametric heuristic controller named tryout to execute low-level actions to reach the planned subgoal. It is based on the trial-and-error mechanism which can help the agent to avoid obstacles and escape from getting stuck. All three modules work hierarchically until the agent stops. We further take several recent advances of Vision-and-Language Navigation (VLN) to improve the performance such as pretraining based on large-scale synthetic in-domain dataset, environment-level data augmentation and snapshot model ensemble. Our model won the RxR-Habitat Competition 2022, with 48% and 90% relative improvements over existing methods on NDTW and SR metrics respectively.

[18]  arXiv:2206.11629 [pdf, other]
Title: Global Sensing and Measurements Reuse for Image Compressed Sensing
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Recently, deep network-based image compressed sensing methods achieved high reconstruction quality and reduced computational overhead compared with traditional methods. However, existing methods obtain measurements only from partial features in the network and use them only once for image reconstruction. They ignore there are low, mid, and high-level features in the network\cite{zeiler2014visualizing} and all of them are essential for high-quality reconstruction. Moreover, using measurements only once may not be enough for extracting richer information from measurements. To address these issues, we propose a novel Measurements Reuse Convolutional Compressed Sensing Network (MR-CCSNet) which employs Global Sensing Module (GSM) to collect all level features for achieving an efficient sensing and Measurements Reuse Block (MRB) to reuse measurements multiple times on multi-scale. Finally, experimental results on three benchmark datasets show that our model can significantly outperform state-of-the-art methods.

[19]  arXiv:2206.11653 [pdf, other]
Title: Learning To Generate Scene Graph from Head to Tail
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Scene Graph Generation (SGG) represents objects and their interactions with a graph structure. Recently, many works are devoted to solving the imbalanced problem in SGG. However, underestimating the head predicates in the whole training process, they wreck the features of head predicates that provide general features for tail ones. Besides, assigning excessive attention to the tail predicates leads to semantic deviation. Based on this, we propose a novel SGG framework, learning to generate scene graphs from Head to Tail (SGG-HT), containing Curriculum Re-weight Mechanism (CRM) and Semantic Context Module (SCM). CRM learns head/easy samples firstly for robust features of head predicates and then gradually focuses on tail/hard ones. SCM is proposed to relieve semantic deviation by ensuring the semantic consistency between the generated scene graph and the ground truth in global and local representations. Experiments show that SGG-HT significantly alleviates the biased problem and chieves state-of-the-art performances on Visual Genome.

[20]  arXiv:2206.11657 [pdf, other]
Title: Warped Convolution Networks for Homography Estimation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Homography transformation has an essential relationship with special linear group and the embedding Lie algebra structure. Although the Lie algebra representation is elegant, few researchers have established the connection between homography estimation and algebra expression. In this paper, we propose Warped Convolution Networks (WCN) to effectively estimate the homography transformation by SL(3) group and sl(3) algebra with group convolution. To this end, six commutative subgroups within SL(3) group are composed to form a homography transformation. For each subgroup, a warping function is proposed to bridge the Lie algebra structure to its corresponding parameters in tomography. By taking advantage of the warped convolution, homography estimation is formulated into several simple pseudo-translation regressions. By walking along the Lie topology, our proposed WCN is able to learn the features that are invariant to homography transformation. It can be easily plugged into other popular CNN-based methods. Extensive experiments on POT benchmark and MNIST-Proj dataset show that our proposed method is effective for both homography estimation and classification.

[21]  arXiv:2206.11678 [pdf, other]
Title: BlazePose GHUM Holistic: Real-time 3D Human Landmarks and Pose Estimation
Comments: 4 pages, 4 figures; CVPR Workshop on Computer Vision for Augmented and Virtual Reality, New Orleans, LA, 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We present BlazePose GHUM Holistic, a lightweight neural network pipeline for 3D human body landmarks and pose estimation, specifically tailored to real-time on-device inference. BlazePose GHUM Holistic enables motion capture from a single RGB image including avatar control, fitness tracking and AR/VR effects. Our main contributions include i) a novel method for 3D ground truth data acquisition, ii) updated 3D body tracking with additional hand landmarks and iii) full body pose estimation from a monocular image.

[22]  arXiv:2206.11695 [pdf, other]
Title: NTIRE 2022 Challenge on Perceptual Image Quality Assessment
Comments: This report has been published in CVPR 2022 NTIRE workshop. arXiv admin note: text overlap with arXiv:2105.03072
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

This paper reports on the NTIRE 2022 challenge on perceptual image quality assessment (IQA), held in conjunction with the New Trends in Image Restoration and Enhancement workshop (NTIRE) workshop at CVPR 2022. This challenge is held to address the emerging challenge of IQA by perceptual image processing algorithms. The output images of these algorithms have completely different characteristics from traditional distortions and are included in the PIPAL dataset used in this challenge. This challenge is divided into two tracks, a full-reference IQA track similar to the previous NTIRE IQA challenge and a new track that focuses on the no-reference IQA methods. The challenge has 192 and 179 registered participants for two tracks. In the final testing stage, 7 and 8 participating teams submitted their models and fact sheets. Almost all of them have achieved better results than existing IQA methods, and the winning method can demonstrate state-of-the-art performance.

[23]  arXiv:2206.11723 [pdf, other]
Title: Self-Supervised Training with Autoencoders for Visual Anomaly Detection
Authors: Alexander Bauer
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Deep convolutional autoencoders provide an effective tool for learning non-linear dimensionality reduction in an unsupervised way. Recently, they have been used for the task of anomaly detection in the visual domain. By optimising for the reconstruction error using anomaly-free examples, the common belief is that a trained network will have difficulties to reconstruct anomalous parts during the test phase. This is usually done by controlling the capacity of the network by either reducing the size of the bottleneck layer or enforcing sparsity constraints on its activations. However, neither of these techniques does explicitly penalise reconstruction of anomalous signals often resulting in a poor detection. We tackle this problem by adapting a self-supervised learning regime which allows to use discriminative information during training while regularising the model to focus on the data manifold by means of a modified reconstruction error resulting in an accurate detection. Unlike related approaches, the inference of the proposed method during training and prediction is very efficient processing the whole input image in one single step. Our experiments on the MVTec Anomaly Detection dataset demonstrate high recognition and localisation performance of the proposed method. On the texture-subset, in particular, our approach consistently outperforms a bunch of recent anomaly detection methods by a big margin.

[24]  arXiv:2206.11736 [pdf, other]
Title: NovelCraft: A Dataset for Novelty Detection and Discovery in Open Worlds
Authors: Patrick Feeney (1), Sarah Schneider (1 and 2), Panagiotis Lymperopoulos (1), Liping Liu (1), Matthias Scheutz (1), Michael C. Hughes (1) ((1) Dept. of Computer Science, Tufts University, (2) Center for Vision, Automation and Control, Austrian Institute of Technology)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

In order for artificial agents to perform useful tasks in changing environments, they must be able to both detect and adapt to novelty. However, visual novelty detection research often only evaluates on repurposed datasets such as CIFAR-10 originally intended for object classification. This practice restricts novelties to well-framed images of distinct object types. We suggest that new benchmarks are needed to represent the challenges of navigating an open world. Our new NovelCraft dataset contains multi-modal episodic data of the images and symbolic world-states seen by an agent completing a pogo-stick assembly task within a video game world. In some episodes, we insert novel objects that can impact gameplay. Novelty can vary in size, position, and occlusion within complex scenes. We benchmark state-of-the-art novelty detection and generalized category discovery models with a focus on comprehensive evaluation. Results suggest an opportunity for future research: models aware of task-specific costs of different types of mistakes could more effectively detect and adapt to novelty in open worlds.

[25]  arXiv:2206.11739 [pdf, other]
Title: Evidence fusion with contextual discounting for multi-modality medical image segmentation
Comments: MICCAI2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)

As information sources are usually imperfect, it is necessary to take into account their reliability in multi-source information fusion tasks. In this paper, we propose a new deep framework allowing us to merge multi-MR image segmentation results using the formalism of Dempster-Shafer theory while taking into account the reliability of different modalities relative to different classes. The framework is composed of an encoder-decoder feature extraction module, an evidential segmentation module that computes a belief function at each voxel for each modality, and a multi-modality evidence fusion module, which assigns a vector of discount rates to each modality evidence and combines the discounted evidence using Dempster's rule. The whole framework is trained by minimizing a new loss function based on a discounted Dice index to increase segmentation accuracy and reliability. The method was evaluated on the BraTs 2021 database of 1251 patients with brain tumors. Quantitative and qualitative results show that our method outperforms the state of the art, and implements an effective new idea for merging multi-information within deep neural networks.

[26]  arXiv:2206.11752 [pdf, other]
Title: PromptPose: Language Prompt Helps Animal Pose Estimation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recently, animal pose estimation is attracting increasing interest from the academia (e.g., wildlife and conservation biology) focusing on animal behavior understanding. However, currently animal pose estimation suffers from small datasets and large data variances, making it difficult to obtain robust performance. To tackle this problem, we propose that the rich knowledge about relations between pose-related semantics learned by language models can be utilized to improve the animal pose estimation. Therefore, in this study, we introduce a novel PromptPose framework to effectively apply language models for better understanding the animal poses based on prompt training. In PromptPose, we propose that adapting the language knowledge to the visual animal poses is key to achieve effective animal pose estimation. To this end, we first introduce textual prompts to build connections between textual semantic descriptions and supporting animal keypoint features. Moreover, we further devise a pixel-level contrastive loss to build dense connections between textual descriptions and local image features, as well as a semantic-level contrastive loss to bridge the gap between global contrasts in language-image cross-modal pre-training and local contrasts in dense prediction. In practice, the PromptPose has shown great benefits for improving animal pose estimation. By conducting extensive experiments, we show that our PromptPose achieves superior performance under both supervised and few-shot settings, outperforming representative methods by a large margin. The source code and models will be made publicly available.

[27]  arXiv:2206.11759 [pdf, other]
Title: What makes you, you? Analyzing Recognition by Swapping Face Parts
Comments: Accepted for publication at 26TH International Conference on Pattern Recognition (ICPR), 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

Deep learning advanced face recognition to an unprecedented accuracy. However, understanding how local parts of the face affect the overall recognition performance is still mostly unclear. Among others, face swap has been experimented to this end, but just for the entire face. In this paper, we propose to swap facial parts as a way to disentangle the recognition relevance of different face parts, like eyes, nose and mouth. In our method, swapping parts from a source face to a target one is performed by fitting a 3D prior, which establishes dense pixels correspondence between parts, while also handling pose differences. Seamless cloning is then used to obtain smooth transitions between the mapped source regions and the shape and skin tone of the target face. We devised an experimental protocol that allowed us to draw some preliminary conclusions when the swapped images are classified by deep networks, indicating a prominence of the eyes and eyebrows region. Code available at https://github.com/clferrari/FacePartsSwap

[28]  arXiv:2206.11768 [pdf, other]
Title: FitGAN: Fit- and Shape-Realistic Generative Adversarial Networks for Fashion
Comments: 26th International Conference on Pattern Recognition (ICPR) 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Amidst the rapid growth of fashion e-commerce, remote fitting of fashion articles remains a complex and challenging problem and a main driver of customers' frustration. Despite the recent advances in 3D virtual try-on solutions, such approaches still remain limited to a very narrow - if not only a handful - selection of articles, and often for only one size of those fashion items. Other state-of-the-art approaches that aim to support customers find what fits them online mostly require a high level of customer engagement and privacy-sensitive data (such as height, weight, age, gender, belly shape, etc.), or alternatively need images of customers' bodies in tight clothing. They also often lack the ability to produce fit and shape aware visual guidance at scale, coming up short by simply advising which size to order that would best match a customer's physical body attributes, without providing any information on how the garment may fit and look. Contributing towards taking a leap forward and surpassing the limitations of current approaches, we present FitGAN, a generative adversarial model that explicitly accounts for garments' entangled size and fit characteristics of online fashion at scale. Conditioned on the fit and shape of the articles, our model learns disentangled item representations and generates realistic images reflecting the true fit and shape properties of fashion articles. Through experiments on real world data at scale, we demonstrate how our approach is capable of synthesizing visually realistic and diverse fits of fashion items and explore its ability to control fit and shape of images for thousands of online garments.

[29]  arXiv:2206.11804 [pdf, other]
Title: Rethinking Surgical Instrument Segmentation: A Background Image Can Be All You Need
Comments: 10 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Data diversity and volume are crucial to the success of training deep learning models, while in the medical imaging field, the difficulty and cost of data collection and annotation are especially huge. Specifically in robotic surgery, data scarcity and imbalance have heavily affected the model accuracy and limited the design and deployment of deep learning-based surgical applications such as surgical instrument segmentation. Considering this, in this paper, we rethink the surgical instrument segmentation task and propose a one-to-many data generation solution that gets rid of the complicated and expensive process of data collection and annotation from robotic surgery. In our method, we only utilize a single surgical background tissue image and a few open-source instrument images as the seed images and apply multiple augmentations and blending techniques to synthesize amounts of image variations. In addition, we also introduce the chained augmentation mixing during training to further enhance the data diversities. The proposed approach is evaluated on the real datasets of the EndoVis-2018 and EndoVis-2017 surgical scene segmentation. Our empirical analysis suggests that without the high cost of data collection and annotation, we can achieve decent surgical instrument segmentation performance. Moreover, we also observe that our method can deal with novel instrument prediction in the deployment domain. We hope our inspiring results would encourage researchers to emphasize data-centric methods to overcome demanding deep learning limitations besides data shortage, such as class imbalance, domain adaptation, and incremental learning.

[30]  arXiv:2206.11808 [pdf, other]
Title: Unseen Object 6D Pose Estimation: A Benchmark and Baselines
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Estimating the 6D pose for unseen objects is in great demand for many real-world applications. However, current state-of-the-art pose estimation methods can only handle objects that are previously trained. In this paper, we propose a new task that enables and facilitates algorithms to estimate the 6D pose estimation of novel objects during testing. We collect a dataset with both real and synthetic images and up to 48 unseen objects in the test set. In the mean while, we propose a new metric named Infimum ADD (IADD) which is an invariant measurement for objects with different types of pose ambiguity. A two-stage baseline solution for this task is also provided. By training an end-to-end 3D correspondences network, our method finds corresponding points between an unseen object and a partial view RGBD image accurately and efficiently. It then calculates the 6D pose from the correspondences using an algorithm robust to object symmetry. Extensive experiments show that our method outperforms several intuitive baselines and thus verify its effectiveness. All the data, code and models will be made publicly available. Project page: www.graspnet.net/unseen6d

[31]  arXiv:2206.11825 [pdf, other]
Title: YOLOSA: Object detection based on 2D local feature superimposed self-attention
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

We analyzed the network structure of real-time object detection models and found that the features in the feature concatenation stage are very rich. Applying an attention module here can effectively improve the detection accuracy of the model. However, the commonly used attention module or self-attention module shows poor performance in detection accuracy and inference efficiency. Therefore, we propose a novel self-attention module, called 2D local feature superimposed self-attention, for the feature concatenation stage of the neck network. This self-attention module reflects global features through local features and local receptive fields. We also propose and optimize an efficient decoupled head and AB-OTA, and achieve SOTA results. Average precisions of 49.0\% (66.2 FPS), 46.1\% (80.6 FPS), and 39.1\% (100 FPS) were obtained for large, medium, and small-scale models built using our proposed improvements. Our models exceeded YOLOv5 by 0.8\% -- 3.1\% in average precision.

[32]  arXiv:2206.11826 [pdf, other]
Title: Toward Clinically Assisted Colorectal Polyp Recognition via Structured Cross-modal Representation Consistency
Comments: Early Accepted by MICCAI 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The colorectal polyps classification is a critical clinical examination. To improve the classification accuracy, most computer-aided diagnosis algorithms recognize colorectal polyps by adopting Narrow-Band Imaging (NBI). However, the NBI usually suffers from missing utilization in real clinic scenarios since the acquisition of this specific image requires manual switching of the light mode when polyps have been detected by using White-Light (WL) images. To avoid the above situation, we propose a novel method to directly achieve accurate white-light colonoscopy image classification by conducting structured cross-modal representation consistency. In practice, a pair of multi-modal images, i.e. NBI and WL, are fed into a shared Transformer to extract hierarchical feature representations. Then a novel designed Spatial Attention Module (SAM) is adopted to calculate the similarities between the class token and patch tokens %from multi-levels for a specific modality image. By aligning the class tokens and spatial attention maps of paired NBI and WL images at different levels, the Transformer achieves the ability to keep both global and local representation consistency for the above two modalities. Extensive experimental results illustrate the proposed method outperforms the recent studies with a margin, realizing multi-modal prediction with a single Transformer while greatly improving the classification accuracy when only with WL images.

[33]  arXiv:2206.11892 [pdf, other]
Title: Remote Sensing Change Detection (Segmentation) using Denoising Diffusion Probabilistic Models
Comments: Code available at: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Human civilization has an increasingly powerful influence on the earth system, and earth observations are an invaluable tool for assessing and mitigating the negative impacts. To this end, observing precisely defined changes on Earth's surface is essential, and we propose an effective way to achieve this goal. Notably, our change detection (CD)/ segmentation method proposes a novel way to incorporate the millions of off-the-shelf, unlabeled, remote sensing images available through different earth observation programs into the training process through denoising diffusion probabilistic models. We first leverage the information from these off-the-shelf, uncurated, and unlabeled remote sensing images by using a pre-trained denoising diffusion probabilistic model and then employ the multi-scale feature representations from the diffusion model decoder to train a lightweight CD classifier to detect precise changes. The experiments performed on four publically available CD datasets show that the proposed approach achieves remarkably better results than the state-of-the-art methods in F1, IoU, and overall accuracy. Code and pre-trained models are available at: https://github.com/wgcban/ddpm-cd

[34]  arXiv:2206.11894 [pdf, other]
Title: MaskViT: Masked Visual Pre-Training for Video Prediction
Comments: Project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

The ability to predict future visual observations conditioned on past observations and motor commands can enable embodied agents to plan solutions to a variety of tasks in complex environments. This work shows that we can create good video prediction models by pre-training transformers via masked visual modeling. Our approach, named MaskViT, is based on two simple design decisions. First, for memory and training efficiency, we use two types of window attention: spatial and spatiotemporal. Second, during training, we mask a variable percentage of tokens instead of a fixed mask ratio. For inference, MaskViT generates all tokens via iterative refinement where we incrementally decrease the masking ratio following a mask scheduling function. On several datasets we demonstrate that MaskViT outperforms prior works in video prediction, is parameter efficient, and can generate high-resolution videos (256x256). Further, we demonstrate the benefits of inference speedup (up to 512x) due to iterative decoding by using MaskViT for planning on a real robot. Our work suggests that we can endow embodied agents with powerful predictive models by leveraging the general framework of masked visual modeling with minimal domain knowledge.

[35]  arXiv:2206.11895 [pdf, other]
Title: Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space
Comments: Pre-print. 20 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, neither these Transformers nor 2D convolutional networks perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations. The key elements of 3DTRL include a pseudo-depth estimator and a learned camera matrix to impose geometric transformations on the tokens. These enable 3DTRL to recover the 3D positional information of the tokens from 2D patches. In practice, 3DTRL is easily plugged-in into a Transformer. Our experiments demonstrate the effectiveness of 3DTRL in many vision tasks including image classification, multi-view video alignment, and action recognition. The models with 3DTRL outperform their backbone Transformers in all the tasks with minimal added computation. Our project page is at https://www3.cs.stonybrook.edu/~jishang/3dtrl/3dtrl.html

[36]  arXiv:2206.11896 [pdf, other]
Title: EventNeRF: Neural Radiance Fields from a Single Colour Event Camera
Comments: 14 pages, 10 figures, 2 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Learning coordinate-based volumetric 3D scene representations such as neural radiance fields (NeRF) has been so far studied assuming RGB or RGB-D images as inputs. At the same time, it is known from the neuroscience literature that human visual system (HVS) is tailored to process asynchronous brightness changes rather than synchronous RGB images, in order to build and continuously update mental 3D representations of the surroundings for navigation and survival. Visual sensors that were inspired by HVS principles are event cameras. Thus, events are sparse and asynchronous per-pixel brightness (or colour channel) change signals. In contrast to existing works on neural 3D scene representation learning, this paper approaches the problem from a new perspective. We demonstrate that it is possible to learn NeRF suitable for novel-view synthesis in the RGB space from asynchronous event streams. Our models achieve high visual accuracy of the rendered novel views of challenging scenes in the RGB space, even though they are trained with substantially fewer data (i.e., event streams from a single event camera moving around the object) and more efficiently (due to the inherent sparsity of event streams) than the existing NeRF models trained with RGB images. We will release our datasets and the source code, see https://4dqv.mpi-inf.mpg.de/EventNeRF/.

Cross-lists for Fri, 24 Jun 22

[37]  arXiv:2206.11260 (cross-list from cs.SD) [pdf, other]
Title: Few-shot Long-Tailed Bird Audio Recognition
Comments: BirdCLEF 202. Code and models at this https URL
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

It is easier to hear birds than see them. However, they still play an essential role in nature and are excellent indicators of deteriorating environmental quality and pollution. Recent advances in Machine Learning and Convolutional Neural Networks allow us to process continuous audio data to detect and classify bird sounds. This technology can assist researchers in monitoring bird populations' status and trends and ecosystems' biodiversity.
We propose a sound detection and classification pipeline to analyze complex soundscape recordings and identify birdcalls in the background. Our method learns from weak labels and few data and acoustically recognizes the bird species. Our solution achieved 18th place of 807 teams at the BirdCLEF 2022 Challenge hosted on Kaggle.

[38]  arXiv:2206.11376 (cross-list from cs.RO) [pdf, other]
Title: Real-Time Online Skeleton Extraction and Gesture Recognition on Pepper
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

We present a multi-stage pipeline for simple gesture recognition. The novelty of our approach is the association of different technologies, resulting in the first real-time system as of now to conjointly extract skeletons and recognise gesture on a Pepper robot. For this task, Pepper has been augmented with an embedded GPU for running deep CNNs and a fish-eye camera to capture whole scene interaction. We show in this article that real-case scenarios are challenging, and the state-of-the-art approaches hardly deal with unknown human gestures. We present here a way to handle such cases.

[39]  arXiv:2206.11458 (cross-list from eess.IV) [pdf, other]
Title: Weighted Concordance Index Loss-based Multimodal Survival Modeling for Radiation Encephalopathy Assessment in Nasopharyngeal Carcinoma Radiotherapy
Comments: 11 pages, 3 figures, MICCAI2022
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

Radiation encephalopathy (REP) is the most common complication for nasopharyngeal carcinoma (NPC) radiotherapy. It is highly desirable to assist clinicians in optimizing the NPC radiotherapy regimen to reduce radiotherapy-induced temporal lobe injury (RTLI) according to the probability of REP onset. To the best of our knowledge, it is the first exploration of predicting radiotherapy-induced REP by jointly exploiting image and non-image data in NPC radiotherapy regimen. We cast REP prediction as a survival analysis task and evaluate the predictive accuracy in terms of the concordance index (CI). We design a deep multimodal survival network (MSN) with two feature extractors to learn discriminative features from multimodal data. One feature extractor imposes feature selection on non-image data, and the other learns visual features from images. Because the priorly balanced CI (BCI) loss function directly maximizing the CI is sensitive to uneven sampling per batch. Hence, we propose a novel weighted CI (WCI) loss function to leverage all REP samples effectively by assigning their different weights with a dual average operation. We further introduce a temperature hyper-parameter for our WCI to sharpen the risk difference of sample pairs to help model convergence. We extensively evaluate our WCI on a private dataset to demonstrate its favourability against its counterparts. The experimental results also show multimodal data of NPC radiotherapy can bring more gains for REP risk prediction.

[40]  arXiv:2206.11461 (cross-list from cs.GR) [pdf, other]
Title: Towards Better User Studies in Computer Graphics and Vision
Comments: 15 pages of text, 5 pages of references, 2 figures
Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

Online crowdsourcing platforms make it easy to perform evaluations of algorithm outputs with surveys that ask questions like "which image is better, A or B?") The proliferation of these "user studies" in vision and graphics research papers has led to an increase of hastily conducted studies that are sloppy and uninformative at best, and potentially harmful and misleading. We argue that more attention needs to be paid to both the design and reporting of user studies in computer vision and graphics papers. In an attempt to improve practitioners' knowledge and increase the trustworthiness and replicability of user studies, we provide an overview of methodologies from user experience research (UXR), human-computer interaction (HCI), and related fields. We discuss foundational user research methods (e.g., needfinding) that are presently underutilized in computer vision and graphics research, but can provide valuable guidance for research projects. We provide further pointers to the literature for readers interested in exploring other UXR methodologies. Finally, we describe broader open issues and recommendations for the research community. We encourage authors and reviewers alike to recognize that not every research contribution requires a user study, and that having no study at all is better than having a carelessly conducted one.

[41]  arXiv:2206.11481 (cross-list from cs.CG) [pdf]
Title: A Novel Algorithm for Exact Concave Hull Extraction
Subjects: Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)

Region extraction is necessary in a wide range of applications, from object detection in autonomous driving to analysis of subcellular morphology in cell biology. There exist two main approaches: convex hull extraction, for which exact and efficient algorithms exist and concave hulls, which are better at capturing real-world shapes but do not have a single solution. Especially in the context of a uniform grid, concave hull algorithms are largely approximate, sacrificing region integrity for spatial and temporal efficiency. In this study, we present a novel algorithm that can provide vertex-minimized concave hulls with maximal (i.e. pixel-perfect) resolution and is tunable for speed-efficiency tradeoffs. Our method provides advantages in multiple downstream applications including data compression, retrieval, visualization, and analysis. To demonstrate the practical utility of our approach, we focus on image compression. We demonstrate significant improvements through context-dependent compression on disparate regions within a single image (entropy encoding for noisy and predictive encoding for the structured regions). We show that these improvements range from biomedical images to natural images. Beyond image compression, our algorithm can be applied more broadly to aid in a wide range of practical applications for data retrieval, visualization, and analysis.

[42]  arXiv:2206.11488 (cross-list from cs.LG) [pdf, other]
Title: On Pre-Training for Federated Learning
Comments: Preprint
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

In most of the literature on federated learning (FL), neural networks are initialized with random weights. In this paper, we present an empirical study on the effect of pre-training on FL. Specifically, we aim to investigate if pre-training can alleviate the drastic accuracy drop when clients' decentralized data are non-IID. We focus on FedAvg, the fundamental and most widely used FL algorithm. We found that pre-training does largely close the gap between FedAvg and centralized learning under non-IID data, but this does not come from alleviating the well-known model drifting problem in FedAvg's local training. Instead, how pre-training helps FedAvg is by making FedAvg's global aggregation more stable. When pre-training using real data is not feasible for FL, we propose a novel approach to pre-train with synthetic data. On various image datasets (including one for segmentation), our approach with synthetic pre-training leads to a notable gain, essentially a critical step toward scaling up federated learning for real-world applications.

[43]  arXiv:2206.11501 (cross-list from eess.IV) [pdf, other]
Title: A novel adversarial learning strategy for medical image classification
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Deep learning (DL) techniques have been extensively utilized for medical image classification. Most DL-based classification networks are generally structured hierarchically and optimized through the minimization of a single loss function measured at the end of the networks. However, such a single loss design could potentially lead to optimization of one specific value of interest but fail to leverage informative features from intermediate layers that might benefit classification performance and reduce the risk of overfitting. Recently, auxiliary convolutional neural networks (AuxCNNs) have been employed on top of traditional classification networks to facilitate the training of intermediate layers to improve classification performance and robustness. In this study, we proposed an adversarial learning-based AuxCNN to support the training of deep neural networks for medical image classification. Two main innovations were adopted in our AuxCNN classification framework. First, the proposed AuxCNN architecture includes an image generator and an image discriminator for extracting more informative image features for medical image classification, motivated by the concept of generative adversarial network (GAN) and its impressive ability in approximating target data distribution. Second, a hybrid loss function is designed to guide the model training by incorporating different objectives of the classification network and AuxCNN to reduce overfitting. Comprehensive experimental studies demonstrated the superior classification performance of the proposed model. The effect of the network-related factors on classification performance was investigated.

[44]  arXiv:2206.11599 (cross-list from eess.IV) [pdf, other]
Title: Universal Learned Image Compression With Low Computational Cost
Comments: 5 pages
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Recently, learned image compression methods have developed rapidly and exhibited excellent rate-distortion performance when compared to traditional standards, such as JPEG, JPEG2000 and BPG. However, the learning-based methods suffer from high computational costs, which is not beneficial for deployment on devices with limited resources. To this end, we propose shift-addition parallel modules (SAPMs), including SAPM-E for the encoder and SAPM-D for the decoder, to largely reduce the energy consumption. To be specific, they can be taken as plug-and-play components to upgrade existing CNN-based architectures, where the shift branch is used to extract large-grained features as compared to small-grained features learned by the addition branch. Furthermore, we thoroughly analyze the probability distribution of latent representations and propose to use Laplace Mixture Likelihoods for more accurate entropy estimation. Experimental results demonstrate that the proposed methods can achieve comparable or even better performance on both PSNR and MS-SSIM metrics to that of the convolutional counterpart with an about 2x energy reduction.

[45]  arXiv:2206.11602 (cross-list from cs.LG) [pdf, other]
Title: Prototype-Anchored Learning for Learning with Imperfect Annotations
Comments: ICML 2022
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

The success of deep neural networks greatly relies on the availability of large amounts of high-quality annotated data, which however are difficult or expensive to obtain. The resulting labels may be class imbalanced, noisy or human biased. It is challenging to learn unbiased classification models from imperfectly annotated datasets, on which we usually suffer from overfitting or underfitting. In this work, we thoroughly investigate the popular softmax loss and margin-based loss, and offer a feasible approach to tighten the generalization error bound by maximizing the minimal sample margin. We further derive the optimality condition for this purpose, which indicates how the class prototypes should be anchored. Motivated by theoretical analysis, we propose a simple yet effective method, namely prototype-anchored learning (PAL), which can be easily incorporated into various learning-based classification schemes to handle imperfect annotation. We verify the effectiveness of PAL on class-imbalanced learning and noise-tolerant learning by extensive experiments on synthetic and real-world datasets.

[46]  arXiv:2206.11623 (cross-list from cs.RO) [pdf, other]
Title: Waypoint Generation in Row-based Crops with Deep Learning and Contrastive Clustering
Comments: Accepted at ECML PKDD 2022
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

The development of precision agriculture has gradually introduced automation in the agricultural process to support and rationalize all the activities related to field management. In particular, service robotics plays a predominant role in this evolution by deploying autonomous agents able to navigate in fields while executing different tasks without the need for human intervention, such as monitoring, spraying and harvesting. In this context, global path planning is the first necessary step for every robotic mission and ensures that the navigation is performed efficiently and with complete field coverage. In this paper, we propose a learning-based approach to tackle waypoint generation for planning a navigation path for row-based crops, starting from a top-view map of the region-of-interest. We present a novel methodology for waypoint clustering based on a contrastive loss, able to project the points to a separable latent space. The proposed deep neural network can simultaneously predict the waypoint position and cluster assignment with two specialized heads in a single forward pass. The extensive experimentation on simulated and real-world images demonstrates that the proposed approach effectively solves the waypoint generation problem for both straight and curved row-based crops, overcoming the limitations of previous state-of-the-art methodologies.

[47]  arXiv:2206.11669 (cross-list from physics.ao-ph) [pdf, other]
Title: Short-range forecasts of global precipitation using using deep learning-augmented numerical weather prediction
Subjects: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Precipitation governs Earth's hydroclimate, and its daily spatiotemporal fluctuations have major socioeconomic effects. Advances in Numerical weather prediction (NWP) have been measured by the improvement of forecasts for various physical fields such as temperature and pressure; however, large biases exist in precipitation prediction. We augment the output of the well-known NWP model CFSv2 with deep learning to create a hybrid model that improves short-range global precipitation at 1-, 2-, and 3-day lead times. To hybridise, we address the sphericity of the global data by using modified DLWP-CS architecture which transforms all the fields to cubed-sphere projection. Dynamical model precipitation and surface temperature outputs are fed into a modified DLWP-CS (UNET) to forecast ground truth precipitation. While CFSv2's average bias is +5 to +7 mm/day over land, the multivariate deep learning model decreases it to within -1 to +1 mm/day. Hurricane Katrina in 2005, Hurricane Ivan in 2004, China floods in 2010, India floods in 2005, and Myanmar storm Nargis in 2008 are used to confirm the substantial enhancement in the skill for the hybrid dynamical-deep learning model. CFSv2 typically shows a moderate to large bias in the spatial pattern and overestimates the precipitation at short-range time scales. The proposed deep learning augmented NWP model can address these biases and vastly improve the spatial pattern and magnitude of predicted precipitation. Deep learning enhanced CFSv2 reduces mean bias by 8x over important land regions for 1 day lead compared to CFSv2. The spatio-temporal deep learning system opens pathways to further the precision and accuracy in global short-range precipitation forecasts.

[48]  arXiv:2206.11849 (cross-list from cs.LG) [pdf, other]
Title: Sample Condensation in Online Continual Learning
Comments: Accepted as a conference paper at 2022 International Joint Conference on Neural Networks (IJCNN 2022). Part of 2022 IEEE World Congress on Computational Intelligence (IEEE WCCI 2022)
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Online Continual learning is a challenging learning scenario where the model must learn from a non-stationary stream of data where each sample is seen only once. The main challenge is to incrementally learn while avoiding catastrophic forgetting, namely the problem of forgetting previously acquired knowledge while learning from new data. A popular solution in these scenario is to use a small memory to retain old data and rehearse them over time. Unfortunately, due to the limited memory size, the quality of the memory will deteriorate over time. In this paper we propose OLCGM, a novel replay-based continual learning strategy that uses knowledge condensation techniques to continuously compress the memory and achieve a better use of its limited size. The sample condensation step compresses old samples, instead of removing them like other replay strategies. As a result, the experiments show that, whenever the memory budget is limited compared to the complexity of the data, OLCGM improves the final accuracy compared to state-of-the-art replay strategies.

Replacements for Fri, 24 Jun 22

[49]  arXiv:2003.12739 (replaced) [pdf, other]
Title: Modulating Bottom-Up and Top-Down Visual Processing via Language-Conditional Filters
Comments: 13 pages, 6 figures, 6 tables. Appeared in MULA Workshop at CVPR 2022
Journal-ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2022, pp. 4610-4620
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
[50]  arXiv:2105.01241 (replaced) [pdf, other]
Title: End-to-end One-shot Human Parsing
Comments: Tech report
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[51]  arXiv:2106.10270 (replaced) [pdf, other]
Title: How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers
Comments: Andreas, Alex, Xiaohua and Lucas contributed equally. We release more than 50'000 ViT models trained under diverse settings on various datasets. Available at this https URL, this https URL and this https URL TMLR review at this https URL
Journal-ref: Transactions on Machine Learning Research (05/2022)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[52]  arXiv:2106.13389 (replaced) [pdf, other]
Title: Energy-Based Generative Cooperative Saliency Prediction
Journal-ref: The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI) 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[53]  arXiv:2107.07919 (replaced) [pdf, other]
Title: A Survey on Bias in Visual Datasets
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[54]  arXiv:2109.12581 (replaced) [pdf]
Title: A Stacking Ensemble Approach for Supervised Video Summarization
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
[55]  arXiv:2111.10943 (replaced) [pdf, other]
Title: Model-Based Single Image Deep Dehazing
Journal-ref: 2022 IEEE International Conference on Image Processing
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[56]  arXiv:2112.13942 (replaced) [pdf, other]
Title: PriFit: Learning to Fit Primitives Improves Few Shot Point Cloud Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[57]  arXiv:2203.02146 (replaced) [pdf, other]
Title: Attention Concatenation Volume for Accurate and Efficient Stereo Matching
Comments: Accepted to CVPR 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[58]  arXiv:2203.09550 (replaced) [pdf]
Title: Multi-similarity based Hyperrelation Network for few-shot segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[59]  arXiv:2204.03296 (replaced) [pdf]
Title: Deep Learning for Real Time Satellite Pose Estimation on Low Power Edge TPU
Comments: Improved literature review; added Figure 2; revised tables for better readibility; corrected typos
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[60]  arXiv:2204.03458 (replaced) [pdf, other]
Title: Video Diffusion Models
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[61]  arXiv:2205.11230 (replaced) [pdf, other]
Title: A Deep Learning Ensemble Framework for Off-Nadir Geocentric Pose Prediction
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[62]  arXiv:2206.03799 (replaced) [pdf, other]
Title: Dyna-DM: Dynamic Object-aware Self-supervised Monocular Depth Maps
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[63]  arXiv:2206.04170 (replaced) [pdf, other]
Title: CASS: Cross Architectural Self-Supervision for Medical Image Analysis
Comments: 15 pages, 4 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[64]  arXiv:2206.06761 (replaced) [pdf, other]
Title: Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO
Comments: 6 pages workshop paper accepted at AdvML Frontiers (ICML 2022)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[65]  arXiv:2206.10225 (replaced) [pdf, other]
Title: Broken News: Making Newspapers Accessible to Print-Impaired
Journal-ref: Extended Abstract at Accessibility, Vision, and Autonomy Meet (CVPR 2022 Workshop)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
[66]  arXiv:2206.10536 (replaced) [pdf, other]
Title: HealNet -- Self-Supervised Acute Wound Heal-Stage Classification
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[67]  arXiv:2206.10698 (replaced) [pdf, other]
Title: TiCo: Transformation Invariance and Covariance Contrast for Self-Supervised Visual Representation Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[68]  arXiv:2206.11215 (replaced) [pdf, other]
Title: Correct and Certify: A New Approach to Self-Supervised 3D-Object Perception
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
[69]  arXiv:2011.06923 (replaced) [pdf, other]
Title: LEAN: graph-based pruning for convolutional neural networks by extracting longest chains
Comments: 10 pages + 2 pages references. Code is publicly available at: this https URL
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
[70]  arXiv:2111.05955 (replaced) [pdf, other]
Title: Keys to Accurate Feature Extraction Using Residual Spiking Neural Networks
Comments: 17 pages, 6 figures, 17 tables
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[71]  arXiv:2112.03227 (replaced) [pdf, other]
Title: CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-horizon Robot Manipulation Tasks
Comments: Accepted for publication at IEEE Robotics and Automation Letters (RAL). Code, models and dataset available at this http URL
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[72]  arXiv:2203.03634 (replaced) [pdf, other]
Title: Remote blood pressure measurement via spatiotemporal mapping of a short-time facial video
Comments: 7 pages, 7 figures
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
[73]  arXiv:2203.04874 (replaced) [pdf, other]
Title: VGQ-CNN: Moving Beyond Fixed Cameras and Top-Grasps for Grasp Quality Prediction
Comments: Accepted for International Joint Conference on Neural Networks (IJCNN) 2022
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
[74]  arXiv:2206.02881 (replaced) [pdf, other]
Title: Mesh-based Dynamics with Occlusion Reasoning for Cloth Manipulation
Comments: RSS 2022, $\href{this https URL}{\text{project website}}$
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
[75]  arXiv:2206.05266 (replaced) [pdf, other]
Title: Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[76]  arXiv:2206.06264 (replaced) [pdf, other]
Title: Automatic Polyp Segmentation with Multiple Kernel Dilated Convolution Network
Journal-ref: Published CBMS 2022
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[ total of 76 entries: 1-76 ]
[ showing up to 500 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2206, contact, help  (Access key information)