We gratefully acknowledge support from
the Simons Foundation and member institutions.

Computer Vision and Pattern Recognition

New submissions

[ total of 92 entries: 1-92 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Fri, 20 May 22

[1]  arXiv:2205.09250 [pdf, other]
Title: Bayesian Convolutional Neural Networks for Limited Data Hyperspectral Remote Sensing Image Classification
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Employing deep neural networks for Hyper-spectral remote sensing (HSRS) image classification is a challenging task. HSRS images have high dimensionality and a large number of channels with substantial redundancy between channels. In addition, the training data for classifying HSRS images is limited and the amount of available training data is much smaller compared to other classification tasks. These factors complicate the training process of deep neural networks with many parameters and cause them to not perform well even compared to conventional models. Moreover, convolutional neural networks produce over-confident predictions, which is highly undesirable considering the aforementioned problem.
In this work, we use a special class of deep neural networks, namely Bayesian neural network, to classify HSRS images. To the extent of our knowledge, this is the first time that this class of neural networks has been used in HSRS image classification. Bayesian neural networks provide an inherent tool for measuring uncertainty. We show that a Bayesian network can outperform a similarly-constructed non-Bayesian convolutional neural network (CNN) and an off-the-shelf Random Forest (RF). Moreover, experimental results for the Pavia Centre, Salinas, and Botswana datasets show that the Bayesian network is more stable and robust to model pruning. Furthermore, we analyze the prediction uncertainty of the Bayesian model and show that the prediction uncertainty metric can provide information about the model predictions and has a positive correlation with the prediction error.

[2]  arXiv:2205.09256 [pdf, other]
Title: Training Vision-Language Transformers from Captions Alone
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

We show that Vision-Language Transformers can be learned without human labels (e.g. class labels, bounding boxes, etc). Existing work, whether explicitly utilizing bounding boxes or patches, assumes that the visual backbone must first be trained on ImageNet class prediction before being integrated into a multimodal linguistic pipeline. We show that this is not necessary and introduce a new model Vision-Language from Captions (VLC) built on top of Masked Auto-Encoders that does not require this supervision. In fact, in a head-to-head comparison between ViLT, the current state-of-the-art patch-based vision-language transformer which is pretrained with supervised object classification, and our model, VLC, we find that our approach 1. outperforms ViLT on standard benchmarks, 2. provides more interpretable and intuitive patch visualizations, and 3. is competitive with many larger models that utilize ROIs trained on annotated bounding-boxes.

[3]  arXiv:2205.09292 [pdf, other]
Title: Free Lunch for Surgical Video Understanding by Distilling Self-Supervisions
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Self-supervised learning has witnessed great progress in vision and NLP; recently, it also attracted much attention to various medical imaging modalities such as X-ray, CT, and MRI. Existing methods mostly focus on building new pretext self-supervision tasks such as reconstruction, orientation, and masking identification according to the properties of medical images. However, the publicly available self-supervision models are not fully exploited. In this paper, we present a powerful yet efficient self-supervision framework for surgical video understanding. Our key insight is to distill knowledge from publicly available models trained on large generic datasets4 to facilitate the self-supervised learning of surgical videos. To this end, we first introduce a semantic-preserving training scheme to obtain our teacher model, which not only contains semantics from the publicly available models, but also can produce accurate knowledge for surgical data. Besides training with only contrastive learning, we also introduce a distillation objective to transfer the rich learned information from the teacher model to self-supervised learning on surgical data. Extensive experiments on two surgical phase recognition benchmarks show that our framework can significantly improve the performance of existing self-supervised learning methods. Notably, our framework demonstrates a compelling advantage under a low-data regime. Our code is available at https://github.com/xmed-lab/DistillingSelf.

[4]  arXiv:2205.09299 [pdf, other]
Title: 3DConvCaps: 3DUnet with Convolutional Capsule Encoder for Medical Image Segmentation
Comments: Accepted to ICPR 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Convolutional Neural Networks (CNNs) have achieved promising results in medical image segmentation. However, CNNs require lots of training data and are incapable of handling pose and deformation of objects. Furthermore, their pooling layers tend to discard important information such as positions as well as CNNs are sensitive to rotation and affine transformation. Capsule network is a recent new architecture that has achieved better robustness in part-whole representation learning by replacing pooling layers with dynamic routing and convolutional strides, which has shown potential results on popular tasks such as digit classification and object segmentation. In this paper, we propose a 3D encoder-decoder network with Convolutional Capsule Encoder (called 3DConvCaps) to learn lower-level features (short-range attention) with convolutional layers while modeling the higher-level features (long-range dependence) with capsule layers. Our experiments on multiple datasets including iSeg-2017, Hippocampus, and Cardiac demonstrate that our 3D 3DConvCaps network considerably outperforms previous capsule networks and 3D-UNets. We further conduct ablation studies of network efficiency and segmentation performance under various configurations of convolution layers and capsule layers at both contracting and expanding paths.

[5]  arXiv:2205.09307 [pdf, other]
Title: Support-set based Multi-modal Representation Enhancement for Video Captioning
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Video captioning is a challenging task that necessitates a thorough comprehension of visual scenes. Existing methods follow a typical one-to-one mapping, which concentrates on a limited sample space while ignoring the intrinsic semantic associations between samples, resulting in rigid and uninformative expressions. To address this issue, we propose a novel and flexible framework, namely Support-set based Multi-modal Representation Enhancement (SMRE) model, to mine rich information in a semantic subspace shared between samples. Specifically, we propose a Support-set Construction (SC) module to construct a support-set to learn underlying connections between samples and obtain semantic-related visual elements. During this process, we design a Semantic Space Transformation (SST) module to constrain relative distance and administrate multi-modal interactions in a self-supervised way. Extensive experiments on MSVD and MSR-VTT datasets demonstrate that our SMRE achieves state-of-the-art performance.

[6]  arXiv:2205.09318 [pdf, other]
Title: On Demographic Bias in Fingerprint Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Fingerprint recognition systems have been deployed globally in numerous applications including personal devices, forensics, law enforcement, banking, and national identity systems. For these systems to be socially acceptable and trustworthy, it is critical that they perform equally well across different demographic groups. In this work, we propose a formal statistical framework to test for the existence of bias (demographic differentials) in fingerprint recognition across four major demographic groups (white male, white female, black male, and black female) for two state-of-the-art (SOTA) fingerprint matchers operating in verification and identification modes. Experiments on two different fingerprint databases (with 15,468 and 1,014 subjects) show that demographic differentials in SOTA fingerprint recognition systems decrease as the matcher accuracy increases and any small bias that may be evident is likely due to certain outlier, low-quality fingerprint images.

[7]  arXiv:2205.09343 [pdf, other]
Title: Physically-Based Editing of Indoor Scene Lighting from a Single Image
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We present a method to edit complex indoor lighting from a single image with its predicted depth and light source segmentation masks. This is an extremely challenging problem that requires modeling complex light transport, and disentangling HDR lighting from material and geometry with only a partial LDR observation of the scene. We tackle this problem using two novel components: 1) a holistic scene reconstruction method that estimates scene reflectance and parametric 3D lighting, and 2) a neural rendering framework that re-renders the scene from our predictions. We use physically-based indoor light representations that allow for intuitive editing, and infer both visible and invisible light sources. Our neural rendering framework combines physically-based direct illumination and shadow rendering with deep networks to approximate global illumination. It can capture challenging lighting effects, such as soft shadows, directional lighting, specular materials, and interreflections. Previous single image inverse rendering methods usually entangle scene lighting and geometry and only support applications like object insertion. Instead, by combining parametric 3D lighting estimation with neural scene rendering, we demonstrate the first automatic method to achieve full scene relighting, including light source insertion, removal, and replacement, from a single image. All source code and data will be publicly released.

[8]  arXiv:2205.09351 [pdf, other]
Title: Mip-NeRF RGB-D: Depth Assisted Fast Neural Radiance Fields
Journal-ref: WSCG 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Neural scene representations, such as neural radiance fields (NeRF), are based on training a multilayer perceptron (MLP) using a set of color images with known poses. An increasing number of devices now produce RGB-D information, which has been shown to be very important for a wide range of tasks. Therefore, the aim of this paper is to investigate what improvements can be made to these promising implicit representations by incorporating depth information with the color images. In particular, the recently proposed Mip-NeRF approach, which uses conical frustums instead of rays for volume rendering, allows one to account for the varying area of a pixel with distance from the camera center. The proposed method additionally models depth uncertainty. This allows to address major limitations of NeRF-based approaches including improving the accuracy of geometry, reduced artifacts, faster training time, and shortened prediction time. Experiments are performed on well-known benchmark scenes, and comparisons show improved accuracy in scene geometry and photometric reconstruction, while reducing the training time by 3 - 5 times.

[9]  arXiv:2205.09363 [pdf, other]
Title: Plane Geometry Diagram Parsing
Comments: Accepted to IJCAI 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Geometry diagram parsing plays a key role in geometry problem solving, wherein the primitive extraction and relation parsing remain challenging due to the complex layout and between-primitive relationship. In this paper, we propose a powerful diagram parser based on deep learning and graph reasoning. Specifically, a modified instance segmentation method is proposed to extract geometric primitives, and the graph neural network (GNN) is leveraged to realize relation parsing and primitive classification incorporating geometric features and prior knowledge. All the modules are integrated into an end-to-end model called PGDPNet to perform all the sub-tasks simultaneously. In addition, we build a new large-scale geometry diagram dataset named PGDP5K with primitive level annotations. Experiments on PGDP5K and an existing dataset IMP-Geometry3K show that our model outperforms state-of-the-art methods in four sub-tasks remarkably. Our code, dataset and appendix material are available at https://github.com/mingliangzhang2018/PGDP.

[10]  arXiv:2205.09373 [pdf, other]
Title: Diversity Matters: Fully Exploiting Depth Clues for Reliable Monocular 3D Object Detection
Comments: This paper has been accepted as an oral presentation of CVPR2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)

As an inherently ill-posed problem, depth estimation from single images is the most challenging part of monocular 3D object detection (M3OD). Many existing methods rely on preconceived assumptions to bridge the missing spatial information in monocular images, and predict a sole depth value for every object of interest. However, these assumptions do not always hold in practical applications. To tackle this problem, we propose a depth solving system that fully explores the visual clues from the subtasks in M3OD and generates multiple estimations for the depth of each target. Since the depth estimations rely on different assumptions in essence, they present diverse distributions. Even if some assumptions collapse, the estimations established on the remaining assumptions are still reliable. In addition, we develop a depth selection and combination strategy. This strategy is able to remove abnormal estimations caused by collapsed assumptions, and adaptively combine the remaining estimations into a single one. In this way, our depth solving system becomes more precise and robust. Exploiting the clues from multiple subtasks of M3OD and without introducing any extra information, our method surpasses the current best method by more than 20% relatively on the Moderate level of test split in the KITTI 3D object detection benchmark, while still maintaining real-time efficiency.

[11]  arXiv:2205.09383 [pdf, other]
Title: Unconventional Visual Sensors for Autonomous Vehicles
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Autonomous vehicles rely on perception systems to understand their surroundings for further navigation missions. Cameras are essential for perception systems due to the advantages of object detection and recognition provided by modern computer vision algorithms, comparing to other sensors, such as LiDARs and radars. However, limited by its inherent imaging principle, a standard RGB camera may perform poorly in a variety of adverse scenarios, including but not limited to: low illumination, high contrast, bad weather such as fog/rain/snow, etc. Meanwhile, estimating the 3D information from the 2D image detection is generally more difficult when compared to LiDARs or radars. Several new sensing technologies have emerged in recent years to address the limitations of conventional RGB cameras. In this paper, we review the principles of four novel image sensors: infrared cameras, range-gated cameras, polarization cameras, and event cameras. Their comparative advantages, existing or potential applications, and corresponding data processing algorithms are all presented in a systematic manner. We expect that this study will assist practitioners in the autonomous driving society with new perspectives and insights.

[12]  arXiv:2205.09392 [pdf, other]
Title: UIF: An Objective Quality Assessment for Underwater Image Enhancement
Comments: This paper was submitted to ACMMM 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Due to complex and volatile lighting environment, underwater imaging can be readily impaired by light scattering, warping, and noises. To improve the visual quality, Underwater Image Enhancement (UIE) techniques have been widely studied. Recent efforts have also been contributed to evaluate and compare the UIE performances with subjective and objective methods. However, the subjective evaluation is time-consuming and uneconomic for all images, while existing objective methods have limited capabilities for the newly-developed UIE approaches based on deep learning. To fill this gap, we propose an Underwater Image Fidelity (UIF) metric for objective evaluation of enhanced underwater images. By exploiting the statistical features of these images, we present to extract naturalness-related, sharpness-related, and structure-related features. Among them, the naturalness-related and sharpness-related features evaluate visual improvement of enhanced images; the structure-related feature indicates structural similarity between images before and after UIE. Then, we employ support vector regression to fuse the above three features into a final UIF metric. In addition, we have also established a large-scale UIE database with subjective scores, namely Underwater Image Enhancement Database (UIED), which is utilized as a benchmark to compare all objective metrics. Experimental results confirm that the proposed UIF outperforms a variety of underwater and general-purpose image quality metrics.

[13]  arXiv:2205.09442 [pdf, other]
Title: Oracle-MNIST: a Realistic Image Dataset for Benchmarking Machine Learning Algorithms
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We introduce the Oracle-MNIST dataset, comprising of 28$\times $28 grayscale images of 30,222 ancient characters from 10 categories, for benchmarking pattern classification, with particular challenges on image noise and distortion. The training set totally consists of 27,222 images, and the test set contains 300 images per class. Oracle-MNIST shares the same data format with the original MNIST dataset, allowing for direct compatibility with all existing classifiers and systems, but it constitutes a more challenging classification task than MNIST. The images of ancient characters suffer from 1) extremely serious and unique noises caused by three-thousand years of burial and aging and 2) dramatically variant writing styles by ancient Chinese, which all make them realistic for machine learning research. The dataset is freely available at https://github.com/wm-bupt/oracle-mnist.

[14]  arXiv:2205.09443 [pdf, ps, other]
Title: PYSKL: Towards Good Practices for Skeleton Action Recognition
Comments: Tech Report
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We present PYSKL: an open-source toolbox for skeleton-based action recognition based on PyTorch. The toolbox supports a wide variety of skeleton action recognition algorithms, including approaches based on GCN and CNN. In contrast to existing open-source skeleton action recognition projects that include only one or two algorithms, PYSKL implements six different algorithms under a unified framework with both the latest and original good practices to ease the comparison of efficacy and efficiency. We also provide an original GCN-based skeleton action recognition model named ST-GCN++, which achieves competitive recognition performance without any complicated attention schemes, serving as a strong baseline. Meanwhile, PYSKL supports the training and testing of nine skeleton-based action recognition benchmarks and achieves state-of-the-art recognition performance on eight of them. To facilitate future research on skeleton action recognition, we also provide a large number of trained models and detailed benchmark results to give some insights. PYSKL is released at https://github.com/kennymckormick/pyskl and is actively maintained. We will update this report when we add new features or benchmarks. The current version corresponds to PYSKL v0.2.

[15]  arXiv:2205.09445 [pdf, other]
Title: Cross-Enhancement Transformer for Action Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Temporal convolutions have been the paradigm of choice in action segmentation, which enhances long-term receptive fields by increasing convolution layers. However, high layers cause the loss of local information necessary for frame recognition. To solve the above problem, a novel encoder-decoder structure is proposed in this paper, called Cross-Enhancement Transformer. Our approach can be effective learning of temporal structure representation with interactive self-attention mechanism. Concatenated each layer convolutional feature maps in encoder with a set of features in decoder produced via self-attention. Therefore, local and global information are used in a series of frame actions simultaneously. In addition, a new loss function is proposed to enhance the training process that penalizes over-segmentation errors. Experiments show that our framework performs state-of-the-art on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities and the Breakfast dataset.

[16]  arXiv:2205.09495 [pdf, other]
Title: Learning Feature Fusion for Unsupervised Domain Adaptive Person Re-identification
Authors: Jin Ding, Xue Zhou
Comments: Accepted by ICPR2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Unsupervised domain adaptive (UDA) person re-identification (ReID) has gained increasing attention for its effectiveness on the target domain without manual annotations. Most fine-tuning based UDA person ReID methods focus on encoding global features for pseudo labels generation, neglecting the local feature that can provide for the fine-grained information. To handle this issue, we propose a Learning Feature Fusion (LF2) framework for adaptively learning to fuse global and local features to obtain a more comprehensive fusion feature representation. Specifically, we first pre-train our model within a source domain, then fine-tune the model on unlabeled target domain based on the teacher-student training strategy. The average weighting teacher network is designed to encode global features, while the student network updating at each iteration is responsible for fine-grained local features. By fusing these multi-view features, multi-level clustering is adopted to generate diverse pseudo labels. In particular, a learnable Fusion Module (FM) for giving prominence to fine-grained local information within the global feature is also proposed to avoid obscure learning of multiple pseudo labels. Experiments show that our proposed LF2 framework outperforms the state-of-the-art with 73.5% mAP and 83.7% Rank1 on Market1501 to DukeMTMC-ReID, and achieves 83.2% mAP and 92.8% Rank1 on DukeMTMC-ReID to Market1501.

[17]  arXiv:2205.09518 [pdf, other]
Title: Enhancing the Transferability of Adversarial Examples via a Few Queries
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Due to the vulnerability of deep neural networks, the black-box attack has drawn great attention from the community. Though transferable priors decrease the query number of the black-box query attacks in recent efforts, the average number of queries is still larger than 100, which is easily affected by the number of queries limit policy. In this work, we propose a novel method called query prior-based method to enhance the family of fast gradient sign methods and improve their attack transferability by using a few queries. Specifically, for the untargeted attack, we find that the successful attacked adversarial examples prefer to be classified as the wrong categories with higher probability by the victim model. Therefore, the weighted augmented cross-entropy loss is proposed to reduce the gradient angle between the surrogate model and the victim model for enhancing the transferability of the adversarial examples. Theoretical analysis and extensive experiments demonstrate that our method could significantly improve the transferability of gradient-based adversarial attacks on CIFAR10/100 and ImageNet and outperform the black-box query attack with the same few queries.

[18]  arXiv:2205.09542 [pdf, other]
Title: Domain Enhanced Arbitrary Image Style Transfer via Contrastive Learning
Comments: Accepted by SIGGRAPH 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

In this work, we tackle the challenging problem of arbitrary image style transfer using a novel style feature representation learning method. A suitable style representation, as a key component in image stylization tasks, is essential to achieve satisfactory results. Existing deep neural network based approaches achieve reasonable results with the guidance from second-order statistics such as Gram matrix of content features. However, they do not leverage sufficient style information, which results in artifacts such as local distortions and style inconsistency. To address these issues, we propose to learn style representation directly from image features instead of their second-order statistics, by analyzing the similarities and differences between multiple styles and considering the style distribution. Specifically, we present Contrastive Arbitrary Style Transfer (CAST), which is a new style representation learning and style transfer method via contrastive learning. Our framework consists of three key components, i.e., a multi-layer style projector for style code encoding, a domain enhancement module for effective learning of style distribution, and a generative network for image style transfer. We conduct qualitative and quantitative evaluations comprehensively to demonstrate that our approach achieves significantly better results compared to those obtained via state-of-the-art methods. Code and models are available at https://github.com/zyxElsa/CAST_pytorch

[19]  arXiv:2205.09576 [pdf, other]
Title: Discovering Dynamic Functional Brain Networks via Spatial and Channel-wise Attention
Comments: 12 pages,6 figures, submitted to 36th Conference on Neural Information Processing Systems (NeurIPS 2022)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

Using deep learning models to recognize functional brain networks (FBNs) in functional magnetic resonance imaging (fMRI) has been attracting increasing interest recently. However, most existing work focuses on detecting static FBNs from entire fMRI signals, such as correlation-based functional connectivity. Sliding-window is a widely used strategy to capture the dynamics of FBNs, but it is still limited in representing intrinsic functional interactive dynamics at each time step. And the number of FBNs usually need to be set manually. More over, due to the complexity of dynamic interactions in brain, traditional linear and shallow models are insufficient in identifying complex and spatially overlapped FBNs across each time step. In this paper, we propose a novel Spatial and Channel-wise Attention Autoencoder (SCAAE) for discovering FBNs dynamically. The core idea of SCAAE is to apply attention mechanism to FBNs construction. Specifically, we designed two attention modules: 1) spatial-wise attention (SA) module to discover FBNs in the spatial domain and 2) a channel-wise attention (CA) module to weigh the channels for selecting the FBNs automatically. We evaluated our approach on ADHD200 dataset and our results indicate that the proposed SCAAE method can effectively recover the dynamic changes of the FBNs at each fMRI time step, without using sliding windows. More importantly, our proposed hybrid attention modules (SA and CA) do not enforce assumptions of linearity and independence as previous methods, and thus provide a novel approach to better understanding dynamic functional brain networks.

[20]  arXiv:2205.09579 [pdf, other]
Title: TRT-ViT: TensorRT-oriented Vision Transformer
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We revisit the existing excellent Transformers from the perspective of practical application. Most of them are not even as efficient as the basic ResNets series and deviate from the realistic deployment scenario. It may be due to the current criterion to measure computation efficiency, such as FLOPs or parameters is one-sided, sub-optimal, and hardware-insensitive. Thus, this paper directly treats the TensorRT latency on the specific hardware as an efficiency metric, which provides more comprehensive feedback involving computational capacity, memory cost, and bandwidth. Based on a series of controlled experiments, this work derives four practical guidelines for TensorRT-oriented and deployment-friendly network design, e.g., early CNN and late Transformer at stage-level, early Transformer and late CNN at block-level. Accordingly, a family of TensortRT-oriented Transformers is presented, abbreviated as TRT-ViT. Extensive experiments demonstrate that TRT-ViT significantly outperforms existing ConvNets and vision Transformers with respect to the latency/accuracy trade-off across diverse visual tasks, e.g., image classification, object detection and semantic segmentation. For example, at 82.7% ImageNet-1k top-1 accuracy, TRT-ViT is 2.7$\times$ faster than CSWin and 2.0$\times$ faster than Twins. On the MS-COCO object detection task, TRT-ViT achieves comparable performance with Twins, while the inference speed is increased by 2.8$\times$.

[21]  arXiv:2205.09586 [pdf, other]
Title: On Trace of PGD-Like Adversarial Attacks
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Adversarial attacks pose safety and security concerns for deep learning applications. Yet largely imperceptible, a strong PGD-like attack may leave strong trace in the adversarial example. Since attack triggers the local linearity of a network, we speculate network behaves in different extents of linearity for benign examples and adversarial examples. Thus, we construct Adversarial Response Characteristics (ARC) features to reflect the model's gradient consistency around the input to indicate the extent of linearity. Under certain conditions, it shows a gradually varying pattern from benign example to adversarial example, as the later leads to Sequel Attack Effect (SAE). ARC feature can be used for informed attack detection (perturbation magnitude is known) with binary classifier, or uninformed attack detection (perturbation magnitude is unknown) with ordinal regression. Due to the uniqueness of SAE to PGD-like attacks, ARC is also capable of inferring other attack details such as loss function, or the ground-truth label as a post-processing defense. Qualitative and quantitative evaluations manifest the effectiveness of ARC feature on CIFAR-10 w/ ResNet-18 and ImageNet w/ ResNet-152 and SwinT-B-IN1K with considerable generalization among PGD-like attacks despite domain shift. Our method is intuitive, light-weighted, non-intrusive, and data-undemanding.

[22]  arXiv:2205.09592 [pdf, other]
Title: Transferable Physical Attack against Object Detection with Separable Attention
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Transferable adversarial attack is always in the spotlight since deep learning models have been demonstrated to be vulnerable to adversarial samples. However, existing physical attack methods do not pay enough attention on transferability to unseen models, thus leading to the poor performance of black-box attack.In this paper, we put forward a novel method of generating physically realizable adversarial camouflage to achieve transferable attack against detection models. More specifically, we first introduce multi-scale attention maps based on detection models to capture features of objects with various resolutions. Meanwhile, we adopt a sequence of composite transformations to obtain the averaged attention maps, which could curb model-specific noise in the attention and thus further boost transferability. Unlike the general visualization interpretation methods where model attention should be put on the foreground object as much as possible, we carry out attack on separable attention from the opposite perspective, i.e. suppressing attention of the foreground and enhancing that of the background. Consequently, transferable adversarial camouflage could be yielded efficiently with our novel attention-based loss function. Extensive comparison experiments verify the superiority of our method to state-of-the-art methods.

[23]  arXiv:2205.09594 [pdf, other]
Title: A Comparative Study of Feature Expansion Unit for 3D Point Cloud Upsampling
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recently, deep learning methods have shown great success in 3D point cloud upsampling. Among these methods, many feature expansion units were proposed to complete point expansion at the end. In this paper, we compare various feature expansion units by both theoretical analysis and quantitative experiments. We show that most of the existing feature expansion units process each point feature independently, while ignoring the feature interaction among different points. Further, inspired by upsampling module of image super-resolution and recent success of dynamic graph CNN on point clouds, we propose a novel feature expansion units named ProEdgeShuffle. Experiments show that our proposed method can achieve considerable improvement over previous feature expansion units.

[24]  arXiv:2205.09601 [pdf, other]
Title: CORPS: Cost-free Rigorous Pseudo-labeling based on Similarity-ranking for Brain MRI Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Segmentation of brain magnetic resonance images (MRI) is crucial for the analysis of the human brain and diagnosis of various brain disorders. The drawbacks of time-consuming and error-prone manual delineation procedures are aimed to be alleviated by atlas-based and supervised machine learning methods where the former methods are computationally intense and the latter methods lack a sufficiently large number of labeled data. With this motivation, we propose CORPS, a semi-supervised segmentation framework built upon a novel atlas-based pseudo-labeling method and a 3D deep convolutional neural network (DCNN) for 3D brain MRI segmentation. In this work, we propose to generate expert-level pseudo-labels for unlabeled set of images in an order based on a local intensity-based similarity score to existing labeled set of images and using a novel atlas-based label fusion method. Then, we propose to train a 3D DCNN on the combination of expert and pseudo labeled images for binary segmentation of each anatomical structure. The binary segmentation approach is proposed to avoid the poor performance of multi-class segmentation methods on limited and imbalanced data. This also allows to employ a lightweight and efficient 3D DCNN in terms of the number of filters and reserve memory resources for training the binary networks on full-scale and full-resolution 3D MRI volumes instead of 2D/3D patches or 2D slices. Thus, the proposed framework can encapsulate the spatial contiguity in each dimension and enhance context-awareness. The experimental results demonstrate the superiority of the proposed framework over the baseline method both qualitatively and quantitatively without additional labeling cost for manual labeling.

[25]  arXiv:2205.09613 [pdf, other]
Title: Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection
Comments: 12 pages,5 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Modern object detectors have taken the advantages of pre-trained vision transformers by using them as backbone networks. However, except for the backbone networks, other detector components, such as the detector head and the feature pyramid network, remain randomly initialized, which hinders the consistency between detectors and pre-trained models. In this study, we propose to integrally migrate the pre-trained transformer encoder-decoders (imTED) for object detection, constructing a feature extraction-operation path that is not only "fully pre-trained" but also consistent with pre-trained models. The essential improvements of imTED over existing transformer-based detectors are twofold: (1) it embeds the pre-trained transformer decoder to the detector head; and (2) it removes the feature pyramid network from the feature extraction path. Such improvements significantly reduce the proportion of randomly initialized parameters and enhance the generation capability of detectors. Experiments on MS COCO dataset demonstrate that imTED consistently outperforms its counterparts by ~2.8% AP. Without bells and whistles, imTED improves the state-of-the-art of few-shot object detection by up to 7.6% AP, demonstrating significantly higher generalization capability. Code will be made publicly available.

[26]  arXiv:2205.09616 [pdf, other]
Title: Masked Image Modeling with Denoising Contrast
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Since the development of self-supervised visual representation learning from contrastive learning to masked image modeling, there is no significant difference in essence, that is, how to design proper pretext tasks for vision dictionary look-up. Masked image modeling recently dominates this line of research with state-of-the-art performance on vision Transformers, where the core is to enhance the patch-level visual context capturing of the network via denoising auto-encoding mechanism. Rather than tailoring image tokenizers with extra training stages as in previous works, we unleash the great potential of contrastive learning on denoising auto-encoding and introduce a new pre-training method, ConMIM, to produce simple intra-image inter-patch contrastive constraints as the learning objectives for masked patch prediction. We further strengthen the denoising mechanism with asymmetric designs, including image perturbations and model progress rates, to improve the network pre-training. ConMIM-pretrained vision Transformers with various scales achieve promising results on downstream image classification, semantic segmentation, object detection, and instance segmentation tasks.

[27]  arXiv:2205.09617 [pdf, other]
Title: A Topological Approach for Semi-Supervised Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Nowadays, Machine Learning and Deep Learning methods have become the state-of-the-art approach to solve data classification tasks. In order to use those methods, it is necessary to acquire and label a considerable amount of data; however, this is not straightforward in some fields, since data annotation is time consuming and might require expert knowledge. This challenge can be tackled by means of semi-supervised learning methods that take advantage of both labelled and unlabelled data. In this work, we present new semi-supervised learning methods based on techniques from Topological Data Analysis (TDA), a field that is gaining importance for analysing large amounts of data with high variety and dimensionality. In particular, we have created two semi-supervised learning methods following two different topological approaches. In the former, we have used a homological approach that consists in studying the persistence diagrams associated with the data using the Bottleneck and Wasserstein distances. In the latter, we have taken into account the connectivity of the data. In addition, we have carried out a thorough analysis of the developed methods using 3 synthetic datasets, 5 structured datasets, and 2 datasets of images. The results show that the semi-supervised methods developed in this work outperform both the results obtained with models trained with only manually labelled data, and those obtained with classical semi-supervised learning methods, reaching improvements of up to a 16%.

[28]  arXiv:2205.09671 [pdf, other]
Title: A graph-transformer for whole slide image classification
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Deep learning is a powerful tool for whole slide image (WSI) analysis. Typically, when performing supervised deep learning, a WSI is divided into small patches, trained and the outcomes are aggregated to estimate disease grade. However, patch-based methods introduce label noise during training by assuming that each patch is independent with the same label as the WSI and neglect overall WSI-level information that is significant in disease grading. Here we present a Graph-Transformer (GT) that fuses a graph-based representation of an WSI and a vision transformer for processing pathology images, called GTP, to predict disease grade. We selected $4,818$ WSIs from the Clinical Proteomic Tumor Analysis Consortium (CPTAC), the National Lung Screening Trial (NLST), and The Cancer Genome Atlas (TCGA), and used GTP to distinguish adenocarcinoma (LUAD) and squamous cell carcinoma (LSCC) from adjacent non-cancerous tissue (normal). First, using NLST data, we developed a contrastive learning framework to generate a feature extractor. This allowed us to compute feature vectors of individual WSI patches, which were used to represent the nodes of the graph followed by construction of the GTP framework. Our model trained on the CPTAC data achieved consistently high performance on three-label classification (normal versus LUAD versus LSCC: mean accuracy$= 91.2$ $\pm$ $2.5\%$) based on five-fold cross-validation, and mean accuracy $= 82.3$ $\pm$ $1.0\%$ on external test data (TCGA). We also introduced a graph-based saliency mapping technique, called GraphCAM, that can identify regions that are highly associated with the class label. Our findings demonstrate GTP as an interpretable and effective deep learning framework for WSI-level classification.

[29]  arXiv:2205.09676 [pdf, other]
Title: Beyond Greedy Search: Tracking by Multi-Agent Reinforcement Learning-based Beam Search
Comments: In Peer Review
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Existing trackers usually select a location or proposal with the maximum score as tracking result for each frame. However, such greedy search scheme maybe not the optimal choice, especially when encountering challenging tracking scenarios like heavy occlusions and fast motion. Since the accumulated errors would make response scores not reliable anymore. In this paper, we propose a novel multi-agent reinforcement learning based beam search strategy (termed BeamTracking) to address this issue. Specifically, we formulate the tracking as a sample selection problem fulfilled by multiple parallel decision-making processes, each of which aims at picking out one sample as their tracking result in each frame. We take the target feature, proposal feature, and its response score as state, and also consider actions predicted by nearby agent, to train multi-agents to select their actions. When all the frames are processed, we select the trajectory with the maximum accumulated score as the tracking result. Extensive experiments on seven popular tracking benchmark datasets validated the effectiveness of the proposed algorithm.

[30]  arXiv:2205.09678 [pdf, ps, other]
Title: Semi-Supervised Learning for Image Classification using Compact Networks in the BioMedical Context
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

The development of mobile and on the edge applications that embed deep convolutional neural models has the potential to revolutionise biomedicine. However, most deep learning models require computational resources that are not available in smartphones or edge devices; an issue that can be faced by means of compact models. The problem with such models is that they are, at least usually, less accurate than bigger models. In this work, we study how this limitation can be addressed with the application of semi-supervised learning techniques. We conduct several statistical analyses to compare performance of deep compact architectures when trained using semi-supervised learning methods for tackling image classification tasks in the biomedical context. In particular, we explore three families of compact networks, and two families of semi-supervised learning techniques for 10 biomedical tasks. By combining semi-supervised learning methods with compact networks, it is possible to obtain a similar performance to standard size networks. In general, the best results are obtained when combining data distillation with MixNet, and plain distillation with ResNet-18. Also, in general, NAS networks obtain better results than manually designed networks and quantized networks. The work presented in this paper shows the benefits of apply semi-supervised methods to compact networks; this allow us to create compact models that are not only as accurate as standard size models, but also faster and lighter. Finally, we have developed a library that simplifies the construction of compact models using semi-supervised learning methods.

[31]  arXiv:2205.09690 [pdf, other]
Title: VNT-Net: Rotational Invariant Vector Neuron Transformers
Comments: arXiv admin note: text overlap with arXiv:2104.12229 by other authors
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Learning 3D point sets with rotational invariance is an important and challenging problem in machine learning. Through rotational invariant architectures, 3D point cloud neural networks are relieved from requiring a canonical global pose and from exhaustive data augmentation with all possible rotations. In this work, we introduce a rotational invariant neural network by combining recently introduced vector neurons with self-attention layers to build a point cloud vector neuron transformer network (VNT-Net). Vector neurons are known for their simplicity and versatility in representing SO(3) actions and are thereby incorporated in common neural operations. Similarly, Transformer architectures have gained popularity and recently were shown successful for images by applying directly on sequences of image patches and achieving superior performance and convergence. In order to benefit from both worlds, we combine the two structures by mainly showing how to adapt the multi-headed attention layers to comply with vector neurons operations. Through this adaptation attention layers become SO(3) and the overall network becomes rotational invariant. Experiments demonstrate that our network efficiently handles 3D point cloud objects in arbitrary poses. We also show that our network achieves higher accuracy when compared to related state-of-the-art methods and requires less training due to a smaller number of hyperparameters in common classification and segmentation tasks.

[32]  arXiv:2205.09722 [pdf, other]
Title: Light In The Black: An Evaluation of Data Augmentation Techniques for COVID-19 CT's Semantic Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

With the COVID-19 global pandemic, computer-assisted diagnoses of medical images have gained much attention, and robust methods of Semantic Segmentation of Computed Tomography (CT) became highly desirable. Semantic Segmentation of CT is one of many research fields of automatic detection of COVID-19 and has been widely explored since the COVID-19 outbreak. In this work, we propose an extensive analysis of how different data augmentation techniques improve the training of encoder-decoder neural networks on this problem. Twenty different data augmentation techniques were evaluated on five different datasets. Each dataset was validated through a five-fold cross-validation strategy, thus resulting in over 3,000 experiments. Our findings show that spatial level transformations are the most promising to improve the learning of neural networks on this problem.

[33]  arXiv:2205.09723 [pdf, other]
Title: Robust and Efficient Medical Imaging with Self-Supervision
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Recent progress in Medical Artificial Intelligence (AI) has delivered systems that can reach clinical expert level performance. However, such systems tend to demonstrate sub-optimal "out-of-distribution" performance when evaluated in clinical settings different from the training environment. A common mitigation strategy is to develop separate systems for each clinical setting using site-specific data [1]. However, this quickly becomes impractical as medical data is time-consuming to acquire and expensive to annotate [2]. Thus, the problem of "data-efficient generalization" presents an ongoing difficulty for Medical AI development. Although progress in representation learning shows promise, their benefits have not been rigorously studied, specifically for out-of-distribution settings. To meet these challenges, we present REMEDIS, a unified representation learning strategy to improve robustness and data-efficiency of medical imaging AI. REMEDIS uses a generic combination of large-scale supervised transfer learning with self-supervised learning and requires little task-specific customization. We study a diverse range of medical imaging tasks and simulate three realistic application scenarios using retrospective data. REMEDIS exhibits significantly improved in-distribution performance with up to 11.5% relative improvement in diagnostic accuracy over a strong supervised baseline. More importantly, our strategy leads to strong data-efficient generalization of medical imaging AI, matching strong supervised baselines using between 1% to 33% of retraining data across tasks. These results suggest that REMEDIS can significantly accelerate the life-cycle of medical imaging AI development thereby presenting an important step forward for medical imaging AI to deliver broad impact.

[34]  arXiv:2205.09731 [pdf, other]
Title: Towards Unified Keyframe Propagation Models
Comments: CVPRW 2022 - AI for Content Creation Workshop. Code at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Many video editing tasks such as rotoscoping or object removal require the propagation of context across frames. While transformers and other attention-based approaches that aggregate features globally have demonstrated great success at propagating object masks from keyframes to the whole video, they struggle to propagate high-frequency details such as textures faithfully. We hypothesize that this is due to an inherent bias of global attention towards low-frequency features. To overcome this limitation, we present a two-stream approach, where high-frequency features interact locally and low-frequency features interact globally. The global interaction stream remains robust in difficult situations such as large camera motions, where explicit alignment fails. The local interaction stream propagates high-frequency details through deformable feature aggregation and, informed by the global interaction stream, learns to detect and correct errors of the deformation field. We evaluate our two-stream approach for inpainting tasks, where experiments show that it improves both the propagation of features within a single frame as required for image inpainting, as well as their propagation from keyframes to target frames. Applied to video inpainting, our approach leads to 44% and 26% improvements in FID and LPIPS scores. Code at https://github.com/runwayml/guided-inpainting

[35]  arXiv:2205.09739 [pdf, other]
Title: Diverse Weight Averaging for Out-of-Distribution Generalization
Comments: 31 pages, 14 figures, 11 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Standard neural networks struggle to generalize under distribution shifts. For out-of-distribution generalization in computer vision, the best current approach averages the weights along a training run. In this paper, we propose Diverse Weight Averaging (DiWA) that makes a simple change to this strategy: DiWA averages the weights obtained from several independent training runs rather than from a single run. Perhaps surprisingly, averaging these weights performs well under soft constraints despite the network's nonlinearities. The main motivation behind DiWA is to increase the functional diversity across averaged models. Indeed, models obtained from different runs are more diverse than those collected along a single run thanks to differences in hyperparameters and training procedures. We motivate the need for diversity by a new bias-variance-covariance-locality decomposition of the expected error, exploiting similarities between DiWA and standard functional ensembling. Moreover, this decomposition highlights that DiWA succeeds when the variance term dominates, which we show happens when the marginal distribution changes at test time. Experimentally, DiWA consistently improves the state of the art on the competitive DomainBed benchmark without inference overhead.

[36]  arXiv:2205.09743 [pdf, other]
Title: BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving
Comments: Code: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems. Unlike existing studies focusing on the improvement of single-task approaches, BEVerse features in producing spatio-temporal Birds-Eye-View (BEV) representations from multi-camera videos and jointly reasoning about multiple tasks for vision-centric autonomous driving. Specifically, BEVerse first performs shared feature extraction and lifting to generate 4D BEV representations from multi-timestamp and multi-view images. After the ego-motion alignment, the spatio-temporal encoder is utilized for further feature extraction in BEV. Finally, multiple task decoders are attached for joint reasoning and prediction. Within the decoders, we propose the grid sampler to generate BEV features with different ranges and granularities for different tasks. Also, we design the method of iterative flow for memory-efficient future prediction. We show that the temporal information improves 3D object detection and semantic map construction, while the multi-task learning can implicitly benefit motion prediction. With extensive experiments on the nuScenes dataset, we show that the multi-task BEVerse outperforms existing single-task methods on 3D object detection, semantic map construction, and motion prediction. Compared with the sequential paradigm, BEVerse also favors in significantly improved efficiency. The code and trained models will be released at https://github.com/zhangyp15/BEVerse.

Cross-lists for Fri, 20 May 22

[37]  arXiv:2205.09114 (cross-list from cond-mat.quant-gas) [pdf, other]
Title: Dark Solitons in Bose-Einstein Condensates: A Dataset for Many-body Physics Research
Comments: 16 pages, 4 figures
Subjects: Quantum Gases (cond-mat.quant-gas); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)

We establish a dataset of over $1.6\times10^4$ experimental images of Bose-Einstein condensates containing solitonic excitations to enable machine learning (ML) for many-body physics research. About 33 % of this dataset has manually assigned and carefully curated labels. The remainder is automatically labeled using SolDet -- an implementation of a physics-informed ML data analysis framework -- consisting of a convolutional-neural-network-based classifier and object detector as well as a statistically motivated physics-informed classifier and a quality metric. This technical note constitutes the definitive reference of the dataset, providing an opportunity for the data science community to develop more sophisticated analysis tools, to further understand nonlinear many-body physics, and even advance cold atom experiments.

[38]  arXiv:2205.09116 (cross-list from eess.IV) [pdf, other]
Title: Exploring the Adjugate Matrix Approach to Quaternion Pose Extraction
Comments: 67 pages, 5 appendices, 9 figures
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

Quaternions are important for a wide variety of rotation-related problems in computer graphics, machine vision, and robotics. We study the nontrivial geometry of the relationship between quaternions and rotation matrices by exploiting the adjugate matrix of the characteristic equation of a related eigenvalue problem to obtain the manifold of the space of a quaternion eigenvector. We argue that quaternions parameterized by their corresponding rotation matrices cannot be expressed, for example, in machine learning tasks, as single-valued functions: the quaternion solution must instead be treated as a manifold, with different algebraic solutions for each of several single-valued sectors represented by the adjugate matrix. We conclude with novel constructions exploiting the quaternion adjugate variables to revisit several classic pose estimation applications: 2D point-cloud matching, 2D point-cloud-to-projection matching, 3D point-cloud matching, 3D orthographic point-cloud-to-projection matching, and 3D perspective point-cloud-to-projection matching. We find an exact solution to the 3D orthographic least squares pose extraction problem, and apply it successfully also to the perspective pose extraction problem with results that improve on existing methods.

[39]  arXiv:2205.09180 (cross-list from cs.LG) [pdf, other]
Title: LeRaC: Learning Rate Curriculum
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

Most curriculum learning methods require an approach to sort the data samples by difficulty, which is often cumbersome to perform. In this work, we propose a novel curriculum learning approach termed Learning Rate Curriculum (LeRaC), which leverages the use of a different learning rate for each layer of a neural network to create a data-free curriculum during the initial training epochs. More specifically, LeRaC assigns higher learning rates to neural layers closer to the input, gradually decreasing the learning rates as the layers are placed farther away from the input. The learning rates increase at various paces during the first training iterations, until they all reach the same value. From this point on, the neural model is trained as usual. This creates a model-level curriculum learning strategy that does not require sorting the examples by difficulty and is compatible with any neural network, generating higher performance levels regardless of the architecture. We conduct comprehensive experiments on eight datasets from the computer vision (CIFAR-10, CIFAR-100, Tiny ImageNet), language (BoolQ, QNLI, RTE) and audio (ESC-50, CREMA-D) domains, considering various convolutional (ResNet-18, Wide-ResNet-50, DenseNet-121), recurrent (LSTM) and transformer (CvT, BERT, SepTr) architectures, comparing our approach with the conventional training regime. Moreover, we also compare with Curriculum by Smoothing (CBS), a state-of-the-art data-free curriculum learning approach. Unlike CBS, our performance improvements over the standard training regime are consistent across all datasets and models. Furthermore, we significantly surpass CBS in terms of training time (there is no additional cost over the standard training regime for LeRaC).

[40]  arXiv:2205.09182 (cross-list from cs.LG) [pdf, other]
Title: Computing the ensemble spread from deterministic weather predictions using conditional generative adversarial networks
Comments: 9 pages, 4 figures, 3 tables; release version
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph); Fluid Dynamics (physics.flu-dyn)

Ensemble prediction systems are an invaluable tool for weather forecasting. Practically, ensemble predictions are obtained by running several perturbations of the deterministic control forecast. However, ensemble prediction is associated with a high computational cost and often involves statistical post-processing steps to improve its quality. Here we propose to use deep-learning-based algorithms to learn the statistical properties of an ensemble prediction system, the ensemble spread, given only the deterministic control forecast. Thus, once trained, the costly ensemble prediction system will not be needed anymore to obtain future ensemble forecasts, and the statistical properties of the ensemble can be derived from a single deterministic forecast. We adapt the classical pix2pix architecture to a three-dimensional model and also experiment with a shared latent space encoder-decoder model, and train them against several years of operational (ensemble) weather forecasts for the 500 hPa geopotential height. The results demonstrate that the trained models indeed allow obtaining a highly accurate ensemble spread from the control forecast only.

[41]  arXiv:2205.09228 (cross-list from cs.LG) [pdf, other]
Title: Scalable Multi-view Clustering with Graph Filtering
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Social and Information Networks (cs.SI)

With the explosive growth of multi-source data, multi-view clustering has attracted great attention in recent years. Most existing multi-view methods operate in raw feature space and heavily depend on the quality of original feature representation. Moreover, they are often designed for feature data and ignore the rich topology structure information. Accordingly, in this paper, we propose a generic framework to cluster both attribute and graph data with heterogeneous features. It is capable of exploring the interplay between feature and structure. Specifically, we first adopt graph filtering technique to eliminate high-frequency noise to achieve a clustering-friendly smooth representation. To handle the scalability challenge, we develop a novel sampling strategy to improve the quality of anchors. Extensive experiments on attribute and graph benchmarks demonstrate the superiority of our approach with respect to state-of-the-art approaches.

[42]  arXiv:2205.09248 (cross-list from cs.SD) [pdf, other]
Title: MESH2IR: Neural Acoustic Impulse Response Generator for Complex 3D Scenes
Comments: More results and source code is available at this https URL
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)

We propose a mesh-based neural network (MESH2IR) to generate acoustic impulse responses (IRs) for indoor 3D scenes represented using a mesh. The IRs are used to create a high-quality sound experience in interactive applications and audio processing. Our method can handle input triangular meshes with arbitrary topologies (2K - 3M triangles). We present a novel training technique to train MESH2IR using energy decay relief and highlight its benefits. We also show that training MESH2IR on IRs preprocessed using our proposed technique significantly improves the accuracy of IR generation. We reduce the non-linearity in the mesh space by transforming 3D scene meshes to latent space using a graph convolution network. Our MESH2IR is more than 200 times faster than a geometric acoustic algorithm on a CPU and can generate more than 10,000 IRs per second on an NVIDIA GeForce RTX 2080 Ti GPU for a given furnished indoor 3D scene. The acoustic metrics are used to characterize the acoustic environment. We show that the acoustic metrics of the IRs predicted from our MESH2IR match the ground truth with less than 10% error. We also highlight the benefits of MESH2IR on audio and speech processing applications such as speech dereverberation and speech separation. To the best of our knowledge, ours is the first neural-network-based approach to predict IRs from a given 3D scene mesh in real-time.

[43]  arXiv:2205.09249 (cross-list from cs.CL) [pdf, other]
Title: On the Limits of Evaluating Embodied Agent Model Generalization Using Validation Sets
Comments: ACL 2022 Insights Workshop (6 pages)
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Natural language guided embodied task completion is a challenging problem since it requires understanding natural language instructions, aligning them with egocentric visual observations, and choosing appropriate actions to execute in the environment to produce desired changes. We experiment with augmenting a transformer model for this task with modules that effectively utilize a wider field of view and learn to choose whether the next step requires a navigation or manipulation action. We observed that the proposed modules resulted in improved, and in fact state-of-the-art performance on an unseen validation set of a popular benchmark dataset, ALFRED. However, our best model selected using the unseen validation set underperforms on the unseen test split of ALFRED, indicating that performance on the unseen validation set may not in itself be a sufficient indicator of whether model improvements generalize to unseen test sets. We highlight this result as we believe it may be a wider phenomenon in machine learning tasks but primarily noticeable only in benchmarks that limit evaluations on test splits, and highlights the need to modify benchmark design to better account for variance in model performance.

[44]  arXiv:2205.09315 (cross-list from eess.IV) [pdf, other]
Title: A Sub-pixel Accurate Quantification of Joint Space Narrowing Progression in Rheumatoid Arthritis
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Rheumatoid arthritis (RA) is a chronic autoimmune disease that primarily affects peripheral synovial joints, like fingers, wrist and feet. Radiology plays a critical role in the diagnosis and monitoring of RA. Limited by the current spatial resolution of radiographic imaging, joint space narrowing (JSN) progression of RA with the same reason above can be less than one pixel per year with universal spatial resolution. Insensitive monitoring of JSN can hinder the radiologist/rheumatologist from making a proper and timely clinical judgment. In this paper, we propose a novel and sensitive method that we call partial image phase-only correlation which aims to automatically quantify JSN progression in the early stages of RA. The majority of the current literature utilizes the mean error, root-mean-square deviation and standard deviation to report the accuracy at pixel level. Our work measures JSN progression between a baseline and its follow-up finger joint images by using the phase spectrum in the frequency domain. Using this study, the mean error can be reduced to 0.0130mm when applied to phantom radiographs with ground truth, and 0.0519mm standard deviation for clinical radiography. With its sub-pixel accuracy far beyond manual measurement, we are optimistic that our work is promising for automatically quantifying JSN progression.

[45]  arXiv:2205.09327 (cross-list from cs.AI) [pdf, other]
Title: Let's Talk! Striking Up Conversations via Conversational Visual Question Generation
Comments: Accepted as a full talk paper on AAAI-DEEPDIAL'21
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

An engaging and provocative question can open up a great conversation. In this work, we explore a novel scenario: a conversation agent views a set of the user's photos (for example, from social media platforms) and asks an engaging question to initiate a conversation with the user. The existing vision-to-question models mostly generate tedious and obvious questions, which might not be ideals conversation starters. This paper introduces a two-phase framework that first generates a visual story for the photo set and then uses the story to produce an interesting question. The human evaluation shows that our framework generates more response-provoking questions for starting conversations than other vision-to-question baselines.

[46]  arXiv:2205.09382 (cross-list from eess.IV) [pdf, other]
Title: BabyNet: Residual Transformer Module for Birth Weight Prediction on Fetal Ultrasound Video
Comments: Early accepted for 25th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2022, Singapore
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Predicting fetal weight at birth is an important aspect of perinatal care, particularly in the context of antenatal management, which includes the planned timing and the mode of delivery. Accurate prediction of weight using prenatal ultrasound is challenging as it requires images of specific fetal body parts during advanced pregnancy which is difficult to capture due to poor quality of images caused by the lack of amniotic fluid. As a consequence, predictions which rely on standard methods often suffer from significant errors. In this paper we propose the Residual Transformer Module which extends a 3D ResNet-based network for analysis of 2D+t spatio-temporal ultrasound video scans. Our end-to-end method, called BabyNet, automatically predicts fetal birth weight based on fetal ultrasound video scans. We evaluate BabyNet using a dedicated clinical set comprising 225 2D fetal ultrasound videos of pregnancies from 75 patients performed one day prior to delivery. Experimental results show that BabyNet outperforms several state-of-the-art methods and estimates the weight at birth with accuracy comparable to human experts. Furthermore, combining estimates provided by human experts with those computed by BabyNet yields the best results, outperforming either of other methods by a significant margin. The source code of BabyNet is available at https://github.com/SanoScience/BabyNet.

[47]  arXiv:2205.09448 (cross-list from cs.AI) [pdf, other]
Title: Image Augmentation Based Momentum Memory Intrinsic Reward for Sparse Reward Visual Scenes
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Many scenes in real life can be abstracted to the sparse reward visual scenes, where it is difficult for an agent to tackle the task under the condition of only accepting images and sparse rewards. We propose to decompose this problem into two sub-problems: the visual representation and the sparse reward. To address them, a novel framework IAMMIR combining the self-supervised representation learning with the intrinsic motivation is presented. For visual representation, a representation driven by a combination of the imageaugmented forward dynamics and the reward is acquired. For sparse rewards, a new type of intrinsic reward is designed, the Momentum Memory Intrinsic Reward (MMIR). It utilizes the difference of the outputs from the current model (online network) and the historical model (target network) to present the agent's state familiarity. Our method is evaluated on the visual navigation task with sparse rewards in Vizdoom. Experiments demonstrate that our method achieves the state of the art performance in sample efficiency, at least 2 times faster than the existing methods reaching 100% success rate.

[48]  arXiv:2205.09533 (cross-list from physics.med-ph) [pdf, other]
Title: Estimating the ultrasound attenuation coefficient using convolutional neural networks -- a feasibility study
Comments: 4 figures
Subjects: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)

Attenuation coefficient (AC) is a fundamental measure of tissue acoustical properties, which can be used in medical diagnostics. In this work, we investigate the feasibility of using convolutional neural networks (CNNs) to directly estimate AC from radio-frequency (RF) ultrasound signals. To develop the CNNs we used RF signals collected from tissue mimicking numerical phantoms for the AC values in a range from 0.1 to 1.5 dB/(MHz*cm). The models were trained based on 1-D patches of RF data. We obtained mean absolute AC estimation errors of 0.08, 0.12, 0.20, 0.25 for the patch lengths: 10 mm, 5 mm, 2 mm and 1 mm, respectively. We explain the performance of the model by visualizing the frequency content associated with convolutional filters. Our study presents that the AC can be calculated using deep learning, and the weights of the CNNs can have physical interpretation.

[49]  arXiv:2205.09612 (cross-list from cs.LG) [pdf, other]
Title: CLCNet: Rethinking of Ensemble Modeling with Classification Confidence Network
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

In this paper, we propose a Classification Confidence Network (CLCNet) that can determine whether the classification model classifies input samples correctly. It can take a classification result in the form of vector in any dimension, and return a confidence score as output, which represents the probability of an instance being classified correctly. We can utilize CLCNet in a simple cascade structure system consisting of several SOTA (state-of-the-art) classification models, and our experiments show that the system can achieve the following advantages: 1. The system can customize the average computation requirement (FLOPs) per image while inference. 2. Under the same computation requirement, the performance of the system can exceed any model that has identical structure with the model in the system, but different in size. In fact, this is a new type of ensemble modeling. Like general ensemble modeling, it can achieve higher performance than single classification model, yet our system requires much less computation than general ensemble modeling. We have uploaded our code to a github repository: https://github.com/yaoching0/CLCNet-Rethinking-of-Ensemble-Modeling.

[50]  arXiv:2205.09615 (cross-list from cs.LG) [pdf, other]
Title: EXACT: How to Train Your Accuracy
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Classification tasks are usually evaluated in terms of accuracy. However, accuracy is discontinuous and cannot be directly optimized using gradient ascent. Popular methods minimize cross-entropy, Hinge loss, or other surrogate losses, which can lead to suboptimal results. In this paper, we propose a new optimization framework by introducing stochasticity to a model's output and optimizing expected accuracy, i.e. accuracy of the stochastic model. Extensive experiments on image classification show that the proposed optimization method is a powerful alternative to widely used classification losses.

[51]  arXiv:2205.09624 (cross-list from cs.LG) [pdf, other]
Title: Focused Adversarial Attacks
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

Recent advances in machine learning show that neural models are vulnerable to minimally perturbed inputs, or adversarial examples. Adversarial algorithms are optimization problems that minimize the accuracy of ML models by perturbing inputs, often using a model's loss function to craft such perturbations. State-of-the-art object detection models are characterized by very large output manifolds due to the number of possible locations and sizes of objects in an image. This leads to their outputs being sparse and optimization problems that use them incur a lot of unnecessary computation.
We propose to use a very limited subset of a model's learned manifold to compute adversarial examples. Our \textit{Focused Adversarial Attacks} (FA) algorithm identifies a small subset of sensitive regions to perform gradient-based adversarial attacks. FA is significantly faster than other gradient-based attacks when a model's manifold is sparsely activated. Also, its perturbations are more efficient than other methods under the same perturbation constraints. We evaluate FA on the COCO 2017 and Pascal VOC 2007 detection datasets.

[52]  arXiv:2205.09706 (cross-list from eess.IV) [pdf, other]
Title: k-strip: A novel segmentation algorithm in k-space for the application of skull stripping
Comments: 11 pages, 6 figures, 2 tables
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Objectives: Present a novel deep learning-based skull stripping algorithm for magnetic resonance imaging (MRI) that works directly in the information rich k-space.
Materials and Methods: Using two datasets from different institutions with a total of 36,900 MRI slices, we trained a deep learning-based model to work directly with the complex raw k-space data. Skull stripping performed by HD-BET (Brain Extraction Tool) in the image domain were used as the ground truth.
Results: Both datasets were very similar to the ground truth (DICE scores of 92\%-98\% and Hausdorff distances of under 5.5 mm). Results on slices above the eye-region reach DICE scores of up to 99\%, while the accuracy drops in regions around the eyes and below, with partially blurred output. The output of k-strip often smoothed edges at the demarcation to the skull. Binary masks are created with an appropriate threshold.
Conclusion: With this proof-of-concept study, we were able to show the feasibility of working in the k-space frequency domain, preserving phase information, with consistent results. Future research should be dedicated to discovering additional ways the k-space can be used for innovative image analysis and further workflows.

[53]  arXiv:2205.09709 (cross-list from eess.AS) [pdf, other]
Title: Bi-LSTM Scoring Based Similarity Measurement with Agglomerative Hierarchical Clustering (AHC) for Speaker Diarization
Comments: 8 pages, 3 figures, 2 tables, 1 algorithm, Technical Report: Recognition Technologies, Inc
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Majority of speech signals across different scenarios are never available with well-defined audio segments containing only a single speaker. A typical conversation between two speakers consists of segments where their voices overlap, interrupt each other or halt their speech in between multiple sentences. Recent advancements in diarization technology leverage neural network-based approaches to improvise multiple subsystems of speaker diarization system comprising of extracting segment-wise embedding features and detecting changes in the speaker during conversation. However, to identify speaker through clustering, models depend on methodologies like PLDA to generate similarity measure between two extracted segments from a given conversational audio. Since these algorithms ignore the temporal structure of conversations, they tend to achieve a higher Diarization Error Rate (DER), thus leading to misdetections both in terms of speaker and change identification. Therefore, to compare similarity of two speech segments both independently and sequentially, we propose a Bi-directional Long Short-term Memory network for estimating the elements present in the similarity matrix. Once the similarity matrix is generated, Agglomerative Hierarchical Clustering (AHC) is applied to further identify speaker segments based on thresholding. To evaluate the performance, Diarization Error Rate (DER%) metric is used. The proposed model achieves a low DER of 34.80% on a test set of audio samples derived from ICSI Meeting Corpus as compared to traditional PLDA based similarity measurement mechanism which achieved a DER of 39.90%.

[54]  arXiv:2205.09710 (cross-list from cs.CL) [pdf, other]
Title: Voxel-informed Language Grounding
Comments: ACL 2022
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Natural language applied to natural 2D images describes a fundamentally 3D world. We present the Voxel-informed Language Grounder (VLG), a language grounding model that leverages 3D geometric information in the form of voxel maps derived from the visual input using a volumetric reconstruction model. We show that VLG significantly improves grounding accuracy on SNARE, an object reference game task. At the time of writing, VLG holds the top place on the SNARE leaderboard, achieving SOTA results with a 2.0% absolute improvement.

[55]  arXiv:2205.09747 (cross-list from cs.RO) [pdf, other]
Title: HandoverSim: A Simulation Framework and Benchmark for Human-to-Robot Object Handovers
Comments: Accepted to ICRA 2022
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

We introduce a new simulation benchmark "HandoverSim" for human-to-robot object handovers. To simulate the giver's motion, we leverage a recent motion capture dataset of hand grasping of objects. We create training and evaluation environments for the receiver with standardized protocols and metrics. We analyze the performance of a set of baselines and show a correlation with a real-world evaluation. Code is open sourced at https://handover-sim.github.io.

Replacements for Fri, 20 May 22

[56]  arXiv:2011.08641 (replaced) [pdf, other]
Title: A Review of Generalized Zero-Shot Learning Methods
Comments: 24 pages, 12 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[57]  arXiv:2011.12815 (replaced) [pdf, other]
Title: Learning Multiscale Convolutional Dictionaries for Image Reconstruction
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[58]  arXiv:2012.01821 (replaced) [pdf, other]
Title: D-Unet: A Dual-encoder U-Net for Image Splicing Forgery Detection and Localization
Comments: 13 pages, 13 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[59]  arXiv:2104.02230 (replaced) [pdf, other]
Title: Achieving Domain Generalization in Underwater Object Detection by Domain Mixup and Contrastive Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[60]  arXiv:2106.02566 (replaced) [pdf, other]
Title: BR-NPA: A Non-Parametric High-Resolution Attention Model to improve the Interpretability of Attention
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[61]  arXiv:2108.03443 (replaced) [pdf, other]
Title: NODEO: A Neural Ordinary Differential Equation Based Optimization Framework for Deformable Image Registration
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[62]  arXiv:2110.03905 (replaced) [pdf]
Title: COVID-19 Monitoring System using Social Distancing and Face Mask Detection on Surveillance video datasets
Journal-ref: 2021 International Conference on Emerging Smart Computing and Informatics (ESCI), 2021, pp. 449-455
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[63]  arXiv:2110.05706 (replaced) [pdf]
Title: Deep Fusion Prior for Multi-Focus Image Super Resolution Fusion
Comments: 21 pages, 9 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[64]  arXiv:2112.04564 (replaced) [pdf, other]
Title: CoSSL: Co-Learning of Representation and Classifier for Imbalanced Semi-Supervised Learning
Comments: Published at CVPR 2022 as a conference paper. Code at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[65]  arXiv:2201.05314 (replaced) [pdf, other]
Title: A Novel Skeleton-Based Human Activity Discovery Using Particle Swarm Optimization with Gaussian Mutation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
[66]  arXiv:2201.10737 (replaced) [pdf, other]
Title: Class-Aware Generative Adversarial Transformers for Medical Image Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[67]  arXiv:2202.04533 (replaced) [pdf, other]
Title: NIMBLE: A Non-rigid Hand Model with Bones and Muscles
Comments: 16 pages, 18 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
[68]  arXiv:2202.11094 (replaced) [pdf, other]
Title: GroupViT: Semantic Segmentation Emerges from Text Supervision
Comments: CVPR 2022. Project page and code: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[69]  arXiv:2203.09581 (replaced) [pdf, other]
Title: SepTr: Separable Transformer for Audio Spectrogram Processing
Comments: Submitted to INTERSPEECH 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[70]  arXiv:2204.02964 (replaced) [pdf, other]
Title: Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection
Comments: v2: more analysis & stronger results. Preprint. Work in progress. Code and pre-trained models are available at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[71]  arXiv:2204.07953 (replaced) [pdf, other]
Title: Learning with Signatures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[72]  arXiv:2204.11887 (replaced) [pdf, other]
Title: Evolutionary latent space search for driving human portrait generation
Comments: This paper was accepted and presented during the 2021 IEEE Latin American Conference on Computational Intelligence (LA-CCI)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
[73]  arXiv:2205.01920 (replaced) [pdf, other]
Title: Scene Clustering Based Pseudo-labeling Strategy for Multi-modal Aerial View Object Classification
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[74]  arXiv:2205.02162 (replaced) [pdf, other]
Title: UnrealNAS: Can We Search Neural Architectures with Unreal Data?
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[75]  arXiv:2205.03146 (replaced) [pdf, other]
Title: CLIP-CLOP: CLIP-Guided Collage and Photomontage
Comments: 5 pages, 7 figures, accepted at the International Conference on Computational Creativity (ICCC) 2022 as Short Paper: Demo
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[76]  arXiv:2205.03892 (replaced) [pdf, other]
Title: ConvMAE: Masked Convolution Meets Masked Autoencoders
Comments: 10 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[77]  arXiv:2205.04042 (replaced) [pdf, other]
Title: Incremental-DETR: Incremental Few-Shot Object Detection via Self-Supervised Learning
Comments: 11 pages, 2 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[78]  arXiv:2205.04183 (replaced) [pdf, other]
Title: Attracting and Dispersing: A Simple Approach for Source-free Domain Adaptation
Comments: Update the hyperparameter section
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[79]  arXiv:2205.06427 (replaced) [pdf, other]
Title: Test-time Fourier Style Calibration for Domain Generalization
Comments: 31st International Joint Conference on Artificial Intelligence (IJCAI) 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[80]  arXiv:2205.07403 (replaced) [pdf, other]
Title: PillarNet: Real-Time and High-Performance Pillar-based 3D Object Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[81]  arXiv:2205.08565 (replaced) [pdf, other]
Title: Text Detection & Recognition in the Wild for Robot Localization
Comments: 6 papged, VI section, typos corrected, revison changes, no result changes
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[82]  arXiv:2205.08706 (replaced) [pdf, other]
Title: SemiCurv: Semi-Supervised Curvilinear Structure Segmentation
Comments: IEEE Transactions on Image Processing
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[83]  arXiv:2205.08924 (replaced) [pdf, other]
Title: Financial Time Series Data Augmentation with Generative Adversarial Networks and Extended Intertemporal Return Plots
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[84]  arXiv:2105.10699 (replaced) [pdf, other]
Title: Denoising Noisy Neural Networks: A Bayesian Approach with Compensation
Comments: Keywords: Noisy neural network, denoiser, wireless transmission of neural networks, federated edge learning, analog device. 18 pages, 9 figures
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Signal Processing (eess.SP)
[85]  arXiv:2110.09473 (replaced) [src]
Title: DBSegment: Fast and robust segmentation of deep brain structures -- Evaluation of transportability across acquisition domains
Comments: The data used have mistakes. No one has time to correct the data and add a new version, that is why we would like to retract it. Once we have the correct version we will resubmit to arxiv
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
[86]  arXiv:2111.08940 (replaced) [pdf, other]
Title: Transparent Human Evaluation for Image Captioning
Comments: Proc. of NAACL 2022
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
[87]  arXiv:2111.09212 (replaced) [pdf, other]
Title: Single-pass Object-adaptive Data Undersampling and Reconstruction for MRI
Journal-ref: in IEEE Transactions on Computational Imaging, vol. 8, pp. 333-345, 2022
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
[88]  arXiv:2202.05267 (replaced) [pdf, other]
Title: On Real-time Image Reconstruction with Neural Networks for MRI-guided Radiotherapy
Comments: 12 pages, 6 figures, 1 table. v2 has a typo in eqn 1 corrected and references added to the discussion
Subjects: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
[89]  arXiv:2203.17066 (replaced) [pdf, other]
Title: Cross-modal Learning of Graph Representations using Radar Point Cloud for Long-Range Gesture Recognition
Comments: Accepted by IEEE Sensor Array and Multichannel Signal Processing Workshop (SAM 2022)
Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[90]  arXiv:2205.00199 (replaced) [pdf, other]
Title: Cracking White-box DNN Watermarks via Invariant Neuron Transforms
Comments: in submission; a preprint version
Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[91]  arXiv:2205.02152 (replaced) [pdf, other]
Title: Evaluating Transferability for Covid 3D Localization Using CT SARS-CoV-2 segmentation models
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[92]  arXiv:2205.05874 (replaced) [pdf, other]
Title: Distinction Maximization Loss: Efficiently Improving Classification Accuracy, Uncertainty Estimation, and Out-of-Distribution Detection Simply Replacing the Loss and Calibrating
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
[ total of 92 entries: 1-92 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2205, contact, help  (Access key information)