We gratefully acknowledge support from
the Simons Foundation and member institutions.

Computer Vision and Pattern Recognition

New submissions

[ total of 157 entries: 1-157 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Tue, 19 Jan 21

[1]  arXiv:2101.06278 [pdf, other]
Title: Catching Out-of-Context Misinformation with Self-supervised Learning
Comments: Video : this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Despite the recent attention to DeepFakes and other forms of image manipulations, one of the most prevalent ways to mislead audiences is the use of unaltered images in a new but false context. To address these challenges and support fact-checkers, we propose a new method that automatically detects out-of-context image and text pairs. Our core idea is a self-supervised training strategy where we only need images with matching (and non-matching) captions from different sources. At train time, our method learns to selectively align individual objects in an image with textual claims, without explicit supervision. At test time, we check for a given text pair if both texts correspond to same object(s) in the image but semantically convey different descriptions, which allows us to make fairly accurate out-of-context predictions. Our method achieves 82% out-of-context detection accuracy. To facilitate training our method, we created a large-scale dataset of 203,570 images which we match with 456,305 textual captions from a variety of news websites, blogs, and social media posts; i.e., for each image, we obtained several captions.

[2]  arXiv:2101.06310 [pdf, other]
Title: Automated Diagnosis of Intestinal Parasites: A new hybrid approach and its benefits
Comments: 18 pages, 11 figures
Journal-ref: Computers in Biology and Medicine, Volume 123, August 2020, 103917
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

Intestinal parasites are responsible for several diseases in human beings. In order to eliminate the error-prone visual analysis of optical microscopy slides, we have investigated automated, fast, and low-cost systems for the diagnosis of human intestinal parasites. In this work, we present a hybrid approach that combines the opinion of two decision-making systems with complementary properties: ($DS_1$) a simpler system based on very fast handcrafted image feature extraction and support vector machine classification and ($DS_2$) a more complex system based on a deep neural network, Vgg-16, for image feature extraction and classification. $DS_1$ is much faster than $DS_2$, but it is less accurate than $DS_2$. Fortunately, the errors of $DS_1$ are not the same of $DS_2$. During training, we use a validation set to learn the probabilities of misclassification by $DS_1$ on each class based on its confidence values. When $DS_1$ quickly classifies all images from a microscopy slide, the method selects a number of images with higher chances of misclassification for characterization and reclassification by $DS_2$. Our hybrid system can improve the overall effectiveness without compromising efficiency, being suitable for the clinical routine -- a strategy that might be suitable for other real applications. As demonstrated on large datasets, the proposed system can achieve, on average, 94.9%, 87.8%, and 92.5% of Cohen's Kappa on helminth eggs, helminth larvae, and protozoa cysts, respectively.

[3]  arXiv:2101.06333 [pdf, other]
Title: Optical Flow Estimation via Motion Feature Recovery
Comments: 5 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Optical flow estimation with occlusion or large displacement is a problematic challenge due to the lost of corresponding pixels between consecutive frames. In this paper, we discover that the lost information is related to a large quantity of motion features (more than 40%) computed from the popular discriminative cost-volume feature would completely vanish due to invalid sampling, leading to the low efficiency of optical flow learning. We call this phenomenon the Vanishing Cost Volume Problem. Inspired by the fact that local motion tends to be highly consistent within a short temporal window, we propose a novel iterative Motion Feature Recovery (MFR) method to address the vanishing cost volume via modeling motion consistency across multiple frames. In each MFR iteration, invalid entries from original motion features are first determined based on the current flow. Then, an efficient network is designed to adaptively learn the motion correlation to recover invalid features for lost-information restoration. The final optical flow is then decoded from the recovered motion features. Experimental results on Sintel and KITTI show that our method achieves state-of-the-art performances. In fact, MFR currently ranks second on Sintel public website.

[4]  arXiv:2101.06381 [pdf, other]
Title: Diversified Patch-based Style Transfer with Shifted Style Normalization
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Gram-based and patch-based approaches are two important research lines of image style transfer. Recent diversified Gram-based methods have been able to produce multiple and diverse reasonable solutions for the same content and style inputs. However, as another popular research interest, the diversity of patch-based methods remains challenging due to the stereotyped style swapping process based on nearest patch matching. To resolve this dilemma, in this paper, we dive into the core style swapping process of patch-based style transfer and explore possible ways to diversify it. What stands out is an operation called shifted style normalization (SSN), the most effective and efficient way to empower existing patch-based methods to generate diverse results for arbitrary styles. The key insight is to use an important intuition that neural patches with higher activation values could contribute more to diversity. Theoretical analyses and extensive experiments are conducted to demonstrate the effectiveness of our method, and compared with other possible options and state-of-the-art algorithms, it shows remarkable superiority in both diversity and efficiency.

[5]  arXiv:2101.06390 [pdf, other]
Title: GridTracer: Automatic Mapping of Power Grids using Deep Learning and Overhead Imagery
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Energy system information valuable for electricity access planning such as the locations and connectivity of electricity transmission and distribution towers, termed the power grid, is often incomplete, outdated, or altogether unavailable. Furthermore, conventional means for collecting this information is costly and limited. We propose to automatically map the grid in overhead remotely sensed imagery using deep learning. Towards this goal, we develop and publicly-release a large dataset ($263km^2$) of overhead imagery with ground truth for the power grid, to our knowledge this is the first dataset of its kind in the public domain. Additionally, we propose scoring metrics and baseline algorithms for two grid mapping tasks: (1) tower recognition and (2) power line interconnection (i.e., estimating a graph representation of the grid). We hope the availability of the training data, scoring metrics, and baselines will facilitate rapid progress on this important problem to help decision-makers address the energy needs of societies around the world.

[6]  arXiv:2101.06391 [pdf, other]
Title: Unsupervised Noisy Tracklet Person Re-identification
Comments: was submitted to ICCV2019
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Existing person re-identification (re-id) methods mostly rely on supervised model learning from a large set of person identity labelled training data per domain. This limits their scalability and usability in large scale deployments. In this work, we present a novel selective tracklet learning (STL) approach that can train discriminative person re-id models from unlabelled tracklet data in an unsupervised manner. This avoids the tedious and costly process of exhaustively labelling person image/tracklet true matching pairs across camera views. Importantly, our method is particularly more robust against arbitrary noisy data of raw tracklets therefore scalable to learning discriminative models from unconstrained tracking data. This differs from a handful of existing alternative methods that often assume the existence of true matches and balanced tracklet samples per identity class. This is achieved by formulating a data adaptive image-to-tracklet selective matching loss function explored in a multi-camera multi-task deep learning model structure. Extensive comparative experiments demonstrate that the proposed STL model surpasses significantly the state-of-the-art unsupervised learning and one-shot learning re-id methods on three large tracklet person re-id benchmarks.

[7]  arXiv:2101.06393 [pdf, other]
Title: Real Time Incremental Foveal Texture Mapping for Autonomous Vehicles
Comments: 8 Pages, 10 Figures, 2 Tables. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

We propose an end-to-end real time framework to generate high resolution graphics grade textured 3D map of urban environment. The generated detailed map finds its application in the precise localization and navigation of autonomous vehicles. It can also serve as a virtual test bed for various vision and planning algorithms as well as a background map in the computer games. In this paper, we focus on two important issues: (i) incrementally generating a map with coherent 3D surface, in real time and (ii) preserving the quality of color texture. To handle the above issues, firstly, we perform a pose-refinement procedure which leverages camera image information, Delaunay triangulation and existing scan matching techniques to produce high resolution 3D map from the sparse input LIDAR scan. This 3D map is then texturized and accumulated by using a novel technique of ray-filtering which handles occlusion and inconsistencies in pose-refinement. Further, inspired by human fovea, we introduce foveal-processing which significantly reduces the computation time and also assists ray-filtering to maintain consistency in color texture and coherency in 3D surface of the output map. Moreover, we also introduce texture error (TE) and mean texture mapping error (MTME), which provides quantitative measure of texturing and overall quality of the textured maps.

[8]  arXiv:2101.06399 [pdf, other]
Title: Latent Variable Models for Visual Question Answering
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Conventional models for Visual Question Answering (VQA) explore deterministic approaches with various types of image features, question features, and attention mechanisms. However, there exist other modalities that can be explored in addition to image and question pairs to bring extra information to the models. In this work, we propose latent variable models for VQA where extra information (e.g. captions and answer categories) are incorporated as latent variables to improve inference, which in turn benefits question-answering performance. Experiments on the VQA v2.0 benchmarking dataset demonstrate the effectiveness of our proposed models in that they improve over strong baselines, especially those that do not rely on extensive language-vision pre-training.

[9]  arXiv:2101.06405 [pdf, other]
Title: Semi Supervised Deep Quick Instance Detection and Segmentation
Comments: 7 Pages, 7 Figures, 5 Tables. 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

In this paper, we present a semi supervised deep quick learning framework for instance detection and pixel-wise semantic segmentation of images in a dense clutter of items. The framework can quickly and incrementally learn novel items in an online manner by real-time data acquisition and generating corresponding ground truths on its own. To learn various combinations of items, it can synthesize cluttered scenes, in real time. The overall approach is based on the tutor-child analogy in which a deep network (tutor) is pretrained for class-agnostic object detection which generates labeled data for another deep network (child). The child utilizes a customized convolutional neural network head for the purpose of quick learning. There are broadly four key components of the proposed framework semi supervised labeling, occlusion aware clutter synthesis, a customized convolutional neural network head, and instance detection. The initial version of this framework was implemented during our participation in Amazon Robotics Challenge (ARC), 2017. Our system was ranked 3rd, 4th and 5th worldwide in pick, stow-pick and stow task respectively. The proposed framework is an improved version over ARC17 where novel features such as instance detection and online learning has been added.

[10]  arXiv:2101.06407 [pdf, ps, other]
Title: ACP: Automatic Channel Pruning via Clustering and Swarm Intelligence Optimization for CNN
Comments: 13 pages, 9 figures, 10 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)

As the convolutional neural network (CNN) gets deeper and wider in recent years, the requirements for the amount of data and hardware resources have gradually increased. Meanwhile, CNN also reveals salient redundancy in several tasks. The existing magnitude-based pruning methods are efficient, but the performance of the compressed network is unpredictable. While the accuracy loss after pruning based on the structure sensitivity is relatively slight, the process is time-consuming and the algorithm complexity is notable. In this article, we propose a novel automatic channel pruning method (ACP). Specifically, we firstly perform layer-wise channel clustering via the similarity of the feature maps to perform preliminary pruning on the network. Then a population initialization method is introduced to transform the pruned structure into a candidate population. Finally, we conduct searching and optimizing iteratively based on the particle swarm optimization (PSO) to find the optimal compressed structure. The compact network is then retrained to mitigate the accuracy loss from pruning. Our method is evaluated against several state-of-the-art CNNs on three different classification datasets CIFAR-10/100 and ILSVRC-2012. On the ILSVRC-2012, when removing 64.36% parameters and 63.34% floating-point operations (FLOPs) of ResNet-50, the Top-1 and Top-5 accuracy drop are less than 0.9%. Moreover, we demonstrate that without harming overall performance it is possible to compress SSD by more than 50% on the target detection dataset PASCAL VOC. It further verifies that the proposed method can also be applied to other CNNs and application scenarios.

[11]  arXiv:2101.06409 [pdf, other]
Title: Shape Back-Projection In 3D Scenes
Comments: 7 pages, 7 figures, 3 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

In this work, we propose a novel framework shape back-projection for computationally efficient point cloud processing in a probabilistic manner. The primary component of the technique is shape histogram and a back-projection procedure. The technique measures similarity between 3D surfaces, by analyzing their geometrical properties. It is analogous to color back-projection which measures similarity between images, simply by looking at their color distributions. In the overall process, first, shape histogram of a sample surface (e.g. planar) is computed, which captures the profile of surface normals around a point in form of a probability distribution. Later, the histogram is back-projected onto a test surface and a likelihood score is obtained. The score depicts that how likely a point in the test surface behaves similar to the sample surface, geometrically. Shape back-projection finds its application in binary surface classification, high curvature edge detection in unorganized point cloud, automated point cloud labeling for 3D-CNNs (convolutional neural network) etc. The algorithm can also be used for real-time robotic operations such as autonomous object picking in warehouse automation, ground plane extraction for autonomous vehicles and can be deployed easily on computationally limited platforms (UAVs).

[12]  arXiv:2101.06411 [pdf, other]
Title: DeepMI: A Mutual Information Based Framework For Unsupervised Deep Learning of Tasks
Comments: 10 pages, 1 figure, 2 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

In this work, we propose an information theory based framework DeepMI to train deep neural networks (DNN) using Mutual Information (MI). The DeepMI framework is especially targeted but not limited to the learning of real world tasks in an unsupervised manner. The primary motivation behind this work is the insufficiency of traditional loss functions for unsupervised task learning. Moreover, directly using MI for the training purpose is quite challenging to deal because of its unbounded above nature. Hence, we develop an alternative linearized representation of MI as a part of the framework. Contributions of this paper are three fold: i) investigation of MI to train deep neural networks, ii) novel loss function LLMI, and iii) a fuzzy logic based end-to-end differentiable pipeline to integrate DeepMI into deep learning framework. We choose a few unsupervised learning tasks for our experimental study. We demonstrate that L LM I alone provides better gradients to achieve a neural network better performance over the cases when multiple loss functions are used for a given task.

[13]  arXiv:2101.06438 [pdf, other]
Title: Adaptive Remote Sensing Image Attribute Learning for Active Object Detection
Comments: Accepted in 25th International Conference on Pattern Recognition (ICPR), (Milan, Italy), January 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In recent years, deep learning methods bring incredible progress to the field of object detection. However, in the field of remote sensing image processing, existing methods neglect the relationship between imaging configuration and detection performance, and do not take into account the importance of detection performance feedback for improving image quality. Therefore, detection performance is limited by the passive nature of the conventional object detection framework. In order to solve the above limitations, this paper takes adaptive brightness adjustment and scale adjustment as examples, and proposes an active object detection method based on deep reinforcement learning. The goal of adaptive image attribute learning is to maximize the detection performance. With the help of active object detection and image attribute adjustment strategies, low-quality images can be converted into high-quality images, and the overall performance is improved without retraining the detector.

[14]  arXiv:2101.06462 [pdf, other]
Title: Dual-Level Collaborative Transformer for Image Captioning
Comments: AAAI 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Descriptive region features extracted by object detection networks have played an important role in the recent advancements of image captioning. However, they are still criticized for the lack of contextual information and fine-grained details, which in contrast are the merits of traditional grid features. In this paper, we introduce a novel Dual-Level Collaborative Transformer (DLCT) network to realize the complementary advantages of the two features. Concretely, in DLCT, these two features are first processed by a novelDual-way Self Attenion (DWSA) to mine their intrinsic properties, where a Comprehensive Relation Attention component is also introduced to embed the geometric information. In addition, we propose a Locality-Constrained Cross Attention module to address the semantic noises caused by the direct fusion of these two features, where a geometric alignment graph is constructed to accurately align and reinforce region and grid features. To validate our model, we conduct extensive experiments on the highly competitive MS-COCO dataset, and achieve new state-of-the-art performance on both local and online test sets, i.e., 133.8% CIDEr-D on Karpathy split and 135.4% CIDEr on the official split. Code is available at https://github.com/luo3300612/image-captioning-DLCT.

[15]  arXiv:2101.06498 [pdf, other]
Title: Bladder segmentation based on deep learning approaches: current limitations and lessons
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Precise determination and assessment of bladder cancer (BC) extent of muscle invasion involvement guides proper risk stratification and personalized therapy selection. In this context, segmentation of both bladder walls and cancer are of pivotal importance, as it provides invaluable information to stage the primary tumour. Hence, multi region segmentation on patients presenting with symptoms of bladder tumours using deep learning heralds a new level of staging accuracy and prediction of the biologic behaviour of the tumour. Nevertheless, despite the success of these models in other medical problems, progress in multi region bladder segmentation is still at a nascent stage, with just a handful of works tackling a multi region scenario. Furthermore, most existing approaches systematically follow prior literature in other clinical problems, without casting a doubt on the validity of these methods on bladder segmentation, which may present different challenges. Inspired by this, we provide an in-depth look at bladder cancer segmentation using deep learning models. The critical determinants for accurate differentiation of muscle invasive disease, current status of deep learning based bladder segmentation, lessons and limitations of prior work are highlighted.

[16]  arXiv:2101.06541 [pdf, other]
Title: SceneGen: Learning to Generate Realistic Traffic Scenes
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

We consider the problem of generating realistic traffic scenes automatically. Existing methods typically insert actors into the scene according to a set of hand-crafted heuristics and are limited in their ability to model the true complexity and diversity of real traffic scenes, thus inducing a content gap between synthesized traffic scenes versus real ones. As a result, existing simulators lack the fidelity necessary to train and test self-driving vehicles. To address this limitation, we present SceneGen, a neural autoregressive model of traffic scenes that eschews the need for rules and heuristics. In particular, given the ego-vehicle state and a high definition map of surrounding area, SceneGen inserts actors of various classes into the scene and synthesizes their sizes, orientations, and velocities. We demonstrate on two large-scale datasets SceneGen's ability to faithfully model distributions of real traffic scenes. Moreover, we show that SceneGen coupled with sensor simulation can be used to train perception models that generalize to the real world.

[17]  arXiv:2101.06543 [pdf, other]
Title: GeoSim: Photorealistic Image Simulation with Geometry-Aware Composition
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Robotics (cs.RO)

Scalable sensor simulation is an important yet challenging open problem for safety-critical domains such as self-driving. Current work in image simulation either fail to be photorealistic or do not model the 3D environment and the dynamic objects within, losing high-level control and physical realism. In this paper, we present GeoSim, a geometry-aware image composition process that synthesizes novel urban driving scenes by augmenting existing images with dynamic objects extracted from other scenes and rendered at novel poses. Towards this goal, we first build a diverse bank of 3D objects with both realistic geometry and appearance from sensor data. During simulation, we perform a novel geometry-aware simulation-by-composition procedure which 1) proposes plausible and realistic object placements into a given scene, 2) renders novel views of dynamic objects from the asset bank, and 3) composes and blends the rendered image segments. The resulting synthetic images are photorealistic, traffic-aware, and geometrically consistent, allowing image simulation to scale to complex use cases. We demonstrate two such important applications: long-range realistic video simulation across multiple camera sensors, and synthetic data generation for data augmentation on downstream segmentation tasks.

[18]  arXiv:2101.06545 [pdf, other]
Title: VideoClick: Video Object Segmentation with a Single Click
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Annotating videos with object segmentation masks typically involves a two stage procedure of drawing polygons per object instance for all the frames and then linking them through time. While simple, this is a very tedious, time consuming and expensive process, making the creation of accurate annotations at scale only possible for well-funded labs. What if we were able to segment an object in the full video with only a single click? This will enable video segmentation at scale with a very low budget opening the door to many applications. Towards this goal, in this paper we propose a bottom up approach where given a single click for each object in a video, we obtain the segmentation masks of these objects in the full video. In particular, we construct a correlation volume that assigns each pixel in a target frame to either one of the objects in the reference frame or the background. We then refine this correlation volume via a recurrent attention module and decode the final segmentation. To evaluate the performance, we label the popular and challenging Cityscapes dataset with video object segmentations. Results on this new CityscapesVideo dataset show that our approach outperforms all the baselines in this challenging setting.

[19]  arXiv:2101.06553 [pdf, other]
Title: Self-Supervised Representation Learning from Flow Equivariance
Comments: tech report
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Self-supervised representation learning is able to learn semantically meaningful features; however, much of its recent success relies on multiple crops of an image with very few objects. Instead of learning view-invariant representation from simple images, humans learn representations in a complex world with changing scenes by observing object movement, deformation, pose variation, and ego motion. Motivated by this ability, we present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes with many moving objects. Our framework features a simple flow equivariance objective that encourages the network to predict the features of another frame by applying a flow transformation to the features of the current frame. Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images. Readout experiments on challenging semantic segmentation, instance segmentation, and object detection benchmarks show that we are able to outperform representations obtained from previous state-of-the-art methods including SimCLR and BYOL.

[20]  arXiv:2101.06571 [pdf, other]
Title: S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Constructing and animating humans is an important component for building virtual worlds in a wide variety of applications such as virtual reality or robotics testing in simulation. As there are exponentially many variations of humans with different shape, pose and clothing, it is critical to develop methods that can automatically reconstruct and animate humans at scale from real world data. Towards this goal, we represent the pedestrian's shape, pose and skinning weights as neural implicit functions that are directly learned from data. This representation enables us to handle a wide variety of different pedestrian shapes and poses without explicitly fitting a human parametric body model, allowing us to handle a wider range of human geometries and topologies. We demonstrate the effectiveness of our approach on various datasets and show that our reconstructions outperform existing state-of-the-art methods. Furthermore, our re-animation experiments show that we can generate 3D human animations at scale from a single RGB image (and/or an optional LiDAR sweep) as input.

[21]  arXiv:2101.06586 [pdf, other]
Title: Auto4D: Learning to Label 4D Objects from Sequential Point Clouds
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In the past few years we have seen great advances in 3D object detection thanks to deep learning methods. However, they typically rely on large amounts of high-quality labels to achieve good performance, which often require time-consuming and expensive work by human annotators. To address this we propose an automatic annotation pipeline that generates accurate object trajectories in 3D (ie, 4D labels) from LiDAR point clouds. Different from previous works that consider single frames at a time, our approach directly operates on sequential point clouds to combine richer object observations. The key idea is to decompose the 4D label into two parts: the 3D size of the object, and its motion path describing the evolution of the object's pose through time. More specifically, given a noisy but easy-to-get object track as initialization, our model first estimates the object size from temporally aggregated observations, and then refines its motion path by considering both frame-wise observations as well as temporal motion cues. We validate the proposed method on a large-scale driving dataset and show that our approach achieves significant improvements over the baselines. We also showcase the benefits of our approach under the annotator-in-the-loop setting.

[22]  arXiv:2101.06594 [pdf, other]
Title: PLUME: Efficient 3D Object Detection from Stereo Images
Subjects: Computer Vision and Pattern Recognition (cs.CV)

3D object detection plays a significant role in various robotic applications including self-driving. While many approaches rely on expensive 3D sensors like LiDAR to produce accurate 3D estimates, stereo-based methods have recently shown promising results at a lower cost. Existing methods tackle the problem in two steps: first depth estimation is performed, a pseudo LiDAR point cloud representation is computed from the depth estimates, and then object detection is performed in 3D space. However, because the two separate tasks are optimized in different metric spaces, the depth estimation is biased towards big objects and may cause sub-optimal performance of 3D detection. In this paper we propose a model that unifies these two tasks in the same metric space for the first time. Specifically, our model directly constructs a pseudo LiDAR feature volume (PLUME) in 3D space, which is used to solve both occupancy estimation and object detection tasks. PLUME achieves state-of-the-art performance on the challenging KITTI benchmark, with significantly reduced inference time compared with existing methods.

[23]  arXiv:2101.06605 [pdf, other]
Title: MultiBodySync: Multi-Body Segmentation and Motion Estimation via 3D Scan Synchronization
Comments: Contact: huang-jh18<at>mails<dot>tsinghua<dot>edu<dot>cn
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

We present MultiBodySync, a novel, end-to-end trainable multi-body motion segmentation and rigid registration framework for multiple input 3D point clouds. The two non-trivial challenges posed by this multi-scan multibody setting that we investigate are: (i) guaranteeing correspondence and segmentation consistency across multiple input point clouds capturing different spatial arrangements of bodies or body parts; and (ii) obtaining robust motion-based rigid body segmentation applicable to novel object categories. We propose an approach to address these issues that incorporates spectral synchronization into an iterative deep declarative network, so as to simultaneously recover consistent correspondences as well as motion segmentation. At the same time, by explicitly disentangling the correspondence and motion segmentation estimation modules, we achieve strong generalizability across different object categories. Our extensive evaluations demonstrate that our method is effective on various datasets ranging from rigid parts in articulated objects to individually moving objects in a 3D scene, be it single-view or full point clouds.

[24]  arXiv:2101.06608 [pdf, other]
Title: Network Automatic Pruning: Start NAP and Take a Nap
Comments: An updated version of 'MLPrune: Multi-Layer Pruning for Automated Neural Network Compression'
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Network pruning can significantly reduce the computation and memory footprint of large neural networks. To achieve a good trade-off between model size and performance, popular pruning techniques usually rely on hand-crafted heuristics and require manually setting the compression ratio for each layer. This process is typically time-consuming and requires expert knowledge to achieve good results. In this paper, we propose NAP, a unified and automatic pruning framework for both fine-grained and structured pruning. It can find out unimportant components of a network and automatically decide appropriate compression ratios for different layers, based on a theoretically sound criterion. Towards this goal, NAP uses an efficient approximation of the Hessian for evaluating the importances of components, based on a Kronecker-factored Approximate Curvature method. Despite its simpleness to use, NAP outperforms previous pruning methods by large margins. For fine-grained pruning, NAP can compress AlexNet and VGG16 by 25x, and ResNet-50 by 6.7x without loss in accuracy on ImageNet. For structured pruning (e.g. channel pruning), it can reduce flops of VGG16 by 5.4x and ResNet-50 by 2.3x with only 1% accuracy drop. More importantly, this method is almost free from hyper-parameter tuning and requires no expert knowledge. You can start NAP and then take a nap!

[25]  arXiv:2101.06616 [pdf]
Title: A relic sketch extraction framework based on detail-aware hierarchical deep network
Subjects: Computer Vision and Pattern Recognition (cs.CV)

As the first step of the restoration process of painted relics, sketch extraction plays an important role in cultural research. However, sketch extraction suffers from serious disease corrosion, which results in broken lines and noise. To overcome these problems, we propose a deep learning-based hierarchical sketch extraction framework for painted cultural relics. We design the sketch extraction process into two stages: coarse extraction and fine extraction. In the coarse extraction stage, we develop a novel detail-aware bi-directional cascade network that integrates flow-based difference-of-Gaussians (FDoG) edge detection and a bi-directional cascade network (BDCN) under a transfer learning framework. It not only uses the pre-trained strategy to extenuate the requirements of large datasets for deep network training but also guides the network to learn the detail characteristics by the prior knowledge from FDoG. For the fine extraction stage, we design a new multiscale U-Net (MSU-Net) to effectively remove disease noise and refine the sketch. Specifically, all the features extracted from multiple intermediate layers in the decoder of MSU-Net are fused for sketch predication. Experimental results showed that the proposed method outperforms the other seven state-of-the-art methods in terms of visual and quantitative metrics and can also deal with complex backgrounds.

[26]  arXiv:2101.06634 [pdf, other]
Title: Regional Attention Network (RAN) for Head Pose and Fine-grained Gesture Recognition
Comments: This manuscript is the accepted version of the published paper in IEEE Transaction on Affective Computing
Journal-ref: IEEE Transaction on Affective Computing 2020
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Affect is often expressed via non-verbal body language such as actions/gestures, which are vital indicators for human behaviors. Recent studies on recognition of fine-grained actions/gestures in monocular images have mainly focused on modeling spatial configuration of body parts representing body pose, human-objects interactions and variations in local appearance. The results show that this is a brittle approach since it relies on accurate body parts/objects detection. In this work, we argue that there exist local discriminative semantic regions, whose "informativeness" can be evaluated by the attention mechanism for inferring fine-grained gestures/actions. To this end, we propose a novel end-to-end \textbf{Regional Attention Network (RAN)}, which is a fully Convolutional Neural Network (CNN) to combine multiple contextual regions through attention mechanism, focusing on parts of the images that are most relevant to a given task. Our regions consist of one or more consecutive cells and are adapted from the strategies used in computing HOG (Histogram of Oriented Gradient) descriptor. The model is extensively evaluated on ten datasets belonging to 3 different scenarios: 1) head pose recognition, 2) drivers state recognition, and 3) human action and facial expression recognition. The proposed approach outperforms the state-of-the-art by a considerable margin in different metrics.

[27]  arXiv:2101.06635 [pdf, other]
Title: Context-aware Attentional Pooling (CAP) for Fine-grained Visual Classification
Comments: Extended version of the accepted paper in 35th AAAI Conference on Artificial Intelligence 2021
Journal-ref: 35th AAAI Conference on Artificial Intelligence 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Deep convolutional neural networks (CNNs) have shown a strong ability in mining discriminative object pose and parts information for image recognition. For fine-grained recognition, context-aware rich feature representation of object/scene plays a key role since it exhibits a significant variance in the same subcategory and subtle variance among different subcategories. Finding the subtle variance that fully characterizes the object/scene is not straightforward. To address this, we propose a novel context-aware attentional pooling (CAP) that effectively captures subtle changes via sub-pixel gradients, and learns to attend informative integral regions and their importance in discriminating different subcategories without requiring the bounding-box and/or distinguishable part annotations. We also introduce a novel feature encoding by considering the intrinsic consistency between the informativeness of the integral regions and their spatial structures to capture the semantic correlation among them. Our approach is simple yet extremely effective and can be easily applied on top of a standard classification backbone network. We evaluate our approach using six state-of-the-art (SotA) backbone networks and eight benchmark datasets. Our method significantly outperforms the SotA approaches on six datasets and is very competitive with the remaining two.

[28]  arXiv:2101.06636 [pdf, other]
Title: Coarse Temporal Attention Network (CTA-Net) for Driver's Activity Recognition
Comments: Extended version of the accepted WACV 2021
Journal-ref: Winter Conference on Applications of Computer Vision (WACV 2021)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

There is significant progress in recognizing traditional human activities from videos focusing on highly distinctive actions involving discriminative body movements, body-object and/or human-human interactions. Driver's activities are different since they are executed by the same subject with similar body parts movements, resulting in subtle changes. To address this, we propose a novel framework by exploiting the spatiotemporal attention to model the subtle changes. Our model is named Coarse Temporal Attention Network (CTA-Net), in which coarse temporal branches are introduced in a trainable glimpse network. The goal is to allow the glimpse to capture high-level temporal relationships, such as 'during', 'before' and 'after' by focusing on a specific part of a video. These branches also respect the topology of the temporal dynamics in the video, ensuring that different branches learn meaningful spatial and temporal changes. The model then uses an innovative attention mechanism to generate high-level action specific contextual information for activity recognition by exploring the hidden states of an LSTM. The attention mechanism helps in learning to decide the importance of each hidden state for the recognition task by weighing them when constructing the representation of the video. Our approach is evaluated on four publicly accessible datasets and significantly outperforms the state-of-the-art by a considerable margin with only RGB video as input.

[29]  arXiv:2101.06644 [pdf, other]
Title: HySTER: A Hybrid Spatio-Temporal Event Reasoner
Comments: Preprint accepted by the 35th AAAI Conference on Artificial Intelligence (AAAI-21) Workshop on Hybrid Artificial Intelligence (HAI)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

The task of Video Question Answering (VideoQA) consists in answering natural language questions about a video and serves as a proxy to evaluate the performance of a model in scene sequence understanding. Most methods designed for VideoQA up-to-date are end-to-end deep learning architectures which struggle at complex temporal and causal reasoning and provide limited transparency in reasoning steps. We present the HySTER: a Hybrid Spatio-Temporal Event Reasoner to reason over physical events in videos. Our model leverages the strength of deep learning methods to extract information from video frames with the reasoning capabilities and explainability of symbolic artificial intelligence in an answer set programming framework. We define a method based on general temporal, causal and physics rules which can be transferred across tasks. We apply our model to the CLEVRER dataset and demonstrate state-of-the-art results in question answering accuracy. This work sets the foundations for the incorporation of inductive logic programming in the field of VideoQA.

[30]  arXiv:2101.06650 [pdf, other]
Title: Generalized Image Reconstruction over T-Algebra
Comments: 6 pages, 4 figures, 3 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Commutative Algebra (math.AC)

Principal Component Analysis (PCA) is well known for its capability of dimension reduction and data compression. However, when using PCA for compressing/reconstructing images, images need to be recast to vectors. The vectorization of images makes some correlation constraints of neighboring pixels and spatial information lost. To deal with the drawbacks of the vectorizations adopted by PCA, we used small neighborhoods of each pixel to form compounded pixels and use a tensorial version of PCA, called TPCA (Tensorial Principal Component Analysis), to compress and reconstruct a compounded image of compounded pixels. Our experiments on public data show that TPCA compares favorably with PCA in compressing and reconstructing images. We also show in our experiments that the performance of TPCA increases when the order of compounded pixels increases.

[31]  arXiv:2101.06653 [pdf, other]
Title: LaneRCNN: Distributed Representations for Graph-Centric Motion Forecasting
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Forecasting the future behaviors of dynamic actors is an important task in many robotics applications such as self-driving. It is extremely challenging as actors have latent intentions and their trajectories are governed by complex interactions between the other actors, themselves, and the maps. In this paper, we propose LaneRCNN, a graph-centric motion forecasting model. Importantly, relying on a specially designed graph encoder, we learn a local lane graph representation per actor (LaneRoI) to encode its past motions and the local map topology. We further develop an interaction module which permits efficient message passing among local graph representations within a shared global lane graph. Moreover, we parameterize the output trajectories based on lane graphs, a more amenable prediction parameterization. Our LaneRCNN captures the actor-to-actor and the actor-to-map relations in a distributed and map-aware manner. We demonstrate the effectiveness of our approach on the large-scale Argoverse Motion Forecasting Benchmark. We achieve the 1st place on the leaderboard and significantly outperform previous best results.

[32]  arXiv:2101.06658 [pdf, other]
Title: Trilevel Neural Architecture Search for Efficient Single Image Super-Resolution
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

This paper proposes a trilevel neural architecture search (NAS) method for efficient single image super-resolution (SR). For that, we first define the discrete search space at three-level, i.e., at network-level, cell-level, and kernel-level (convolution-kernel). For modeling the discrete search space, we apply a new continuous relaxation on the discrete search spaces to build a hierarchical mixture of network-path, cell-operations, and kernel-width. Later an efficient search algorithm is proposed to perform optimization in a hierarchical supernet manner that provides a globally optimized and compressed network via joint convolution kernel width pruning, cell structure search, and network path optimization. Unlike current NAS methods, we exploit a sorted sparsestmax activation to let the three-level neural structures contribute sparsely. Consequently, our NAS optimization progressively converges to those neural structures with dominant contributions to the supernet. Additionally, our proposed optimization construction enables a simultaneous search and training in a single phase, which dramatically reduces search and train time compared to the traditional NAS algorithms. Experiments on the standard benchmark datasets demonstrate that our NAS algorithm provides SR models that are significantly lighter in terms of the number of parameters and FLOPS with PSNR value comparable to the current state-of-the-art.

[33]  arXiv:2101.06663 [pdf, other]
Title: Separable Batch Normalization for Robust Facial Landmark Localization with Cross-protocol Network Training
Comments: 10 pages,6 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

A big, diverse and balanced training data is the key to the success of deep neural network training. However, existing publicly available datasets used in facial landmark localization are usually much smaller than those for other computer vision tasks. A small dataset without diverse and balanced training samples cannot support the training of a deep network effectively. To address the above issues, this paper presents a novel Separable Batch Normalization (SepBN) module with a Cross-protocol Network Training (CNT) strategy for robust facial landmark localization. Different from the standard BN layer that uses all the training data to calculate a single set of parameters, SepBN considers that the samples of a training dataset may belong to different sub-domains. Accordingly, the proposed SepBN module uses multiple sets of parameters, each corresponding to a specific sub-domain. However, the selection of an appropriate branch in the inference stage remains a challenging task because the sub-domain of a test sample is unknown. To mitigate this difficulty, we propose a novel attention mechanism that assigns different weights to each branch for automatic selection in an effective style. As a further innovation, the proposed CNT strategy trains a network using multiple datasets having different facial landmark annotation systems, boosting the performance and enhancing the generalization capacity of the trained network. The experimental results obtained on several well-known datasets demonstrate the effectiveness of the proposed method.

[34]  arXiv:2101.06679 [pdf, other]
Title: End-to-end Interpretable Neural Motion Planner
Comments: CVPR 2019 (Oral)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

In this paper, we propose a neural motion planner (NMP) for learning to drive autonomously in complex urban scenarios that include traffic-light handling, yielding, and interactions with multiple road-users. Towards this goal, we design a holistic model that takes as input raw LIDAR data and a HD map and produces interpretable intermediate representations in the form of 3D detections and their future trajectories, as well as a cost volume defining the goodness of each position that the self-driving car can take within the planning horizon. We then sample a set of diverse physically possible trajectories and choose the one with the minimum learned cost. Importantly, our cost volume is able to naturally capture multi-modality. We demonstrate the effectiveness of our approach in real-world driving data captured in several cities in North America. Our experiments show that the learned cost volume can generate safer planning than all the baselines.

[35]  arXiv:2101.06686 [pdf, other]
Title: KCP: Kernel Cluster Pruning for Dense Labeling Neural Networks
Comments: 17 pages, 16 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Pruning has become a promising technique used to compress and accelerate neural networks. Existing methods are mainly evaluated on spare labeling applications. However, dense labeling applications are those closer to real world problems that require real-time processing on resource-constrained mobile devices. Pruning for dense labeling applications is still a largely unexplored field. The prevailing filter channel pruning method removes the entire filter channel. Accordingly, the interaction between each kernel in one filter channel is ignored.
In this study, we proposed kernel cluster pruning (KCP) to prune dense labeling networks. We developed a clustering technique to identify the least representational kernels in each layer. By iteratively removing those kernels, the parameter that can better represent the entire network is preserved; thus, we achieve better accuracy with a decent model size and computation reduction. When evaluated on stereo matching and semantic segmentation neural networks, our method can reduce more than 70% of FLOPs with less than 1% of accuracy drop. Moreover, for ResNet-50 on ILSVRC-2012, our KCP can reduce more than 50% of FLOPs reduction with 0.13% Top-1 accuracy gain. Therefore, KCP achieves state-of-the-art pruning results.

[36]  arXiv:2101.06702 [pdf]
Title: Deep Learning based Virtual Point Tracking for Real-Time Target-less Dynamic Displacement Measurement in Railway Applications
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In the application of computer-vision based displacement measurement, an optical target is usually required to prove the reference. In the case that the optical target cannot be attached to the measuring objective, edge detection, feature matching and template matching are the most common approaches in target-less photogrammetry. However, their performance significantly relies on parameter settings. This becomes problematic in dynamic scenes where complicated background texture exists and varies over time. To tackle this issue, we propose virtual point tracking for real-time target-less dynamic displacement measurement, incorporating deep learning techniques and domain knowledge. Our approach consists of three steps: 1) automatic calibration for detection of region of interest; 2) virtual point detection for each video frame using deep convolutional neural network; 3) domain-knowledge based rule engine for point tracking in adjacent frames. The proposed approach can be executed on an edge computer in a real-time manner (i.e. over 30 frames per second). We demonstrate our approach for a railway application, where the lateral displacement of the wheel on the rail is measured during operation. We also implement an algorithm using template matching and line detection as the baseline for comparison. The numerical experiments have been performed to evaluate the performance and the latency of our approach in the harsh railway environment with noisy and varying backgrounds.

[37]  arXiv:2101.06709 [pdf]
Title: Human Activity Recognition Using Multichannel Convolutional Neural Network
Comments: 10 pages, Proceedings of the 2019 5th International Conference on Advances in Electrical Engineering (ICAEE), 26-28 September, Dhaka, Bangladesh
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Human Activity Recognition (HAR) simply refers to the capacity of a machine to perceive human actions. HAR is a prominent application of advanced Machine Learning and Artificial Intelligence techniques that utilize computer vision to understand the semantic meanings of heterogeneous human actions. This paper describes a supervised learning method that can distinguish human actions based on data collected from practical human movements. The primary challenge while working with HAR is to overcome the difficulties that come with the cyclostationary nature of the activity signals. This study proposes a HAR classification model based on a two-channel Convolutional Neural Network (CNN) that makes use of the frequency and power features of the collected human action signals. The model was tested on the UCI HAR dataset, which resulted in a 95.25% classification accuracy. This approach will help to conduct further researches on the recognition of human activities based on their biomedical signals.

[38]  arXiv:2101.06715 [pdf]
Title: Heterogeneous Hand Guise Classification Based on Surface Electromyographic Signals Using Multichannel Convolutional Neural Network
Comments: 10 pages, 2019 22nd International Conference of Computer and Information Technology (ICCIT), 18-20 December, 2019
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Electromyography (EMG) is a way of measuring the bioelectric activities that take place inside the muscles. EMG is usually performed to detect abnormalities within the nerves or muscles of a target area. The recent developments in the field of Machine Learning allow us to use EMG signals to teach machines the complex properties of human movements. Modern machines are capable of detecting numerous human activities and distinguishing among them solely based on the EMG signals produced by those activities. However, success in accomplishing this task mostly depends on the learning technique used by the machine to analyze EMG signals; and even the latest algorithms do not result in flawless classification. In this study, a novel classification method has been described employing a multichannel Convolutional Neural Network (CNN) that interprets surface EMG signals by the properties they exhibit in the power domain. The proposed method was tested on a well-established EMG dataset, and the result yields very high classification accuracy. This learning model will help researchers to develop prosthetic arms capable of detecting various hand gestures to mimic them afterwards.

[39]  arXiv:2101.06720 [pdf, other]
Title: Deep Multi-Task Learning for Joint Localization, Perception, and Prediction
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

Over the last few years, we have witnessed tremendous progress on many subtasks of autonomous driving, including perception, motion forecasting, and motion planning. % methods. However, these systems often assume that the car is accurately localized against a high-definition map. In this paper we question this assumption, and investigate the issues that arise in state-of-the-art autonomy stacks under localization error. Based on our observations, we design a system that jointly performs perception, prediction, and localization. Our architecture is able to reuse computation between both tasks, and is thus able to correct localization errors efficiently. We show experiments on a large-scale autonomy dataset, demonstrating the efficiency and accuracy of our proposed approach.

[40]  arXiv:2101.06742 [pdf, other]
Title: Deep Parametric Continuous Convolutional Neural Networks
Comments: Accepted by CVPR 2018
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Machine Learning (stat.ML)

Standard convolutional neural networks assume a grid structured input is available and exploit discrete convolutions as their fundamental building blocks. This limits their applicability to many real-world applications. In this paper we propose Parametric Continuous Convolution, a new learnable operator that operates over non-grid structured data. The key idea is to exploit parameterized kernel functions that span the full continuous vector space. This generalization allows us to learn over arbitrary data structures as long as their support relationship is computable. Our experiments show significant improvement over the state-of-the-art in point cloud segmentation of indoor and outdoor scenes, and lidar motion estimation of driving scenes.

[41]  arXiv:2101.06747 [pdf, other]
Title: Intestinal Parasites Classification Using Deep Belief Networks
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Currently, approximately $4$ billion people are infected by intestinal parasites worldwide. Diseases caused by such infections constitute a public health problem in most tropical countries, leading to physical and mental disorders, and even death to children and immunodeficient individuals. Although subjected to high error rates, human visual inspection is still in charge of the vast majority of clinical diagnoses. In the past years, some works addressed intelligent computer-aided intestinal parasites classification, but they usually suffer from misclassification due to similarities between parasites and fecal impurities. In this paper, we introduce Deep Belief Networks to the context of automatic intestinal parasites classification. Experiments conducted over three datasets composed of eggs, larvae, and protozoa provided promising results, even considering unbalanced classes and also fecal impurities.

[42]  arXiv:2101.06770 [pdf, other]
Title: Improving Apparel Detection with Category Grouping and Multi-grained Branches
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Training an accurate object detector is expensive and time-consuming. One main reason lies in the laborious labeling process, i.e., annotating category and bounding box information for all instances in every image. In this paper, we examine ways to improve performance of deep object detectors without extra labeling. We first explore to group existing categories of high visual and semantic similarities together as one super category (or, a superclass). Then, we study how this knowledge of hierarchical categories can be exploited to better detect object using multi-grained RCNN top branches. Experimental results on DeepFashion2 and OpenImagesV4-Clothing reveal that the proposed detection heads with multi-grained branches can boost the overall performance by 2.3 mAP for DeepFashion2 and 2.5 mAP for OpenImagesV4-Clothing with no additional time-consuming annotations. More importantly, classes that have fewer training samples tend to benefit more from the proposed multi-grained heads with superclass grouping. In particular, we improve the mAP for last 30% categories (in terms of training sample number) by 2.6 and 4.6 for DeepFashion2 and OpenImagesV4-Clothing, respectively.

[43]  arXiv:2101.06771 [pdf, other]
Title: Temporal Spatial-Adaptive Interpolation with Deformable Refinement for Electron Microscopic Images
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recently, flow-based methods have achieved promising success in video frame interpolation. However, electron microscopic (EM) images suffer from unstable image quality, low PSNR, and disorderly deformation. Existing flow-based interpolation methods cannot precisely compute optical flow for EM images since only predicting each position's unique offset. To overcome these problems, we propose a novel interpolation framework for EM images that progressively synthesizes interpolated features in a coarse-to-fine manner. First, we extract missing intermediate features by the proposed temporal spatial-adaptive (TSA) interpolation module. The TSA interpolation module aggregates temporal contexts and then adaptively samples the spatial-related features with the proposed residual spatial adaptive block. Second, we introduce a stacked deformable refinement block (SDRB) further enhance the reconstruction quality, which is aware of the matching positions and relevant features from input frames with the feedback mechanism. Experimental results demonstrate the superior performance of our approach compared to previous works, both quantitatively and qualitatively.

[44]  arXiv:2101.06773 [pdf, other]
Title: Generating Attribution Maps with Disentangled Masked Backpropagation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Attribution map visualization has arisen as one of the most effective techniques to understand the underlying inference process of Convolutional Neural Networks. In this task, the goal is to compute an score for each image pixel related with its contribution to the final network output. In this paper, we introduce Disentangled Masked Backpropagation (DMBP), a novel gradient-based method that leverages on the piecewise linear nature of ReLU networks to decompose the model function into different linear mappings. This decomposition aims to disentangle the positive, negative and nuisance factors from the attribution maps by learning a set of variables masking the contribution of each filter during back-propagation. A thorough evaluation over standard architectures (ResNet50 and VGG16) and benchmark datasets (PASCAL VOC and ImageNet) demonstrates that DMBP generates more visually interpretable attribution maps than previous approaches. Additionally, we quantitatively show that the maps produced by our method are more consistent with the true contribution of each pixel to the final network output.

[45]  arXiv:2101.06784 [pdf, other]
Title: Exploring Adversarial Robustness of Multi-Sensor Perception Systems in Self Driving
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Modern self-driving perception systems have been shown to improve upon processing complementary inputs such as LiDAR with images. In isolation, 2D images have been found to be extremely vulnerable to adversarial attacks. Yet, there have been limited studies on the adversarial robustness of multi-modal models that fuse LiDAR features with image features. Furthermore, existing works do not consider physically realizable perturbations that are consistent across the input modalities. In this paper, we showcase practical susceptibilities of multi-sensor detection by placing an adversarial object on top of a host vehicle. We focus on physically realizable and input-agnostic attacks as they are feasible to execute in practice, and show that a single universal adversary can hide different host vehicles from state-of-the-art multi-modal detectors. Our experiments demonstrate that successful attacks are primarily caused by easily corrupted image features. Furthermore, we find that in modern sensor fusion methods which project image features into 3D, adversarial attacks can exploit the projection process to generate false positives across distant regions in 3D. Towards more robust multi-modal perception systems, we show that adversarial training with feature denoising can boost robustness to such attacks significantly. However, we find that standard adversarial defenses still struggle to prevent false positives which are also caused by inaccurate associations between 3D LiDAR points and 2D pixels.

[46]  arXiv:2101.06820 [pdf, other]
Title: Chaotic-to-Fine Clustering for Unlabeled Plant Disease Images
Comments: This paper has been submitted to Computer Vision and Image Understanding
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Current annotation for plant disease images depends on manual sorting and handcrafted features by agricultural experts, which is time-consuming and labour-intensive. In this paper, we propose a self-supervised clustering framework for grouping plant disease images based on the vulnerability of Kernel K-means. The main idea is to establish a cross iterative under-clustering algorithm based on Kernel K-means to produce the pseudo-labeled training set and a chaotic cluster to be further classified by a deep learning module. In order to verify the effectiveness of our proposed framework, we conduct extensive experiments on three different plant disease datatsets with five plants and 17 plant diseases. The experimental results show the high superiority of our method to do image-based plant disease classification over balanced and unbalanced datasets by comparing with five state-of-the-art existing works in terms of different metrics.

[47]  arXiv:2101.06832 [pdf, other]
Title: Deep Structured Reactive Planning
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

An intelligent agent operating in the real-world must balance achieving its goal with maintaining the safety and comfort of not only itself, but also other participants within the surrounding scene. This requires jointly reasoning about the behavior of other actors while deciding its own actions as these two processes are inherently intertwined - a vehicle will yield to us if we decide to proceed first at the intersection but will proceed first if we decide to yield. However, this is not captured in most self-driving pipelines, where planning follows prediction. In this paper we propose a novel data-driven, reactive planning objective which allows a self-driving vehicle to jointly reason about its own plans as well as how other actors will react to them. We formulate the problem as an energy-based deep structured model that is learned from observational data and encodes both the planning and prediction problems. Through simulations based on both real-world driving and synthetically generated dense traffic, we demonstrate that our reactive model outperforms a non-reactive variant in successfully completing highly complex maneuvers (lane merges/turns in traffic) faster, without trading off collision rate.

[48]  arXiv:2101.06849 [pdf, other]
Title: CFC-Net: A Critical Feature Capturing Network for Arbitrary-Oriented Object Detection in Remote Sensing Images
Comments: The code and models are available at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Object detection in optical remote sensing images is an important and challenging task. In recent years, the methods based on convolutional neural networks have made good progress. However, due to the large variation in object scale, aspect ratio, and arbitrary orientation, the detection performance is difficult to be further improved. In this paper, we discuss the role of discriminative features in object detection, and then propose a Critical Feature Capturing Network (CFC-Net) to improve detection accuracy from three aspects: building powerful feature representation, refining preset anchors, and optimizing label assignment. Specifically, we first decouple the classification and regression features, and then construct robust critical features adapted to the respective tasks through the Polarization Attention Module (PAM). With the extracted discriminative regression features, the Rotation Anchor Refinement Module (R-ARM) performs localization refinement on preset horizontal anchors to obtain superior rotation anchors. Next, the Dynamic Anchor Learning (DAL) strategy is given to adaptively select high-quality anchors based on their ability to capture critical features. The proposed framework creates more powerful semantic representations for objects in remote sensing images and achieves high-performance real-time object detection. Experimental results on three remote sensing datasets including HRSC2016, DOTA, and UCAS-AOD show that our method achieves superior detection performance compared with many state-of-the-art approaches. Code and models are available at https://github.com/ming71/CFC-Net.

[49]  arXiv:2101.06860 [pdf, other]
Title: Secrets of 3D Implicit Object Shape Reconstruction in the Wild
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Reconstructing high-fidelity 3D objects from sparse, partial observation is of crucial importance for various applications in computer vision, robotics, and graphics. While recent neural implicit modeling methods show promising results on synthetic or dense datasets, they perform poorly on real-world data that is sparse and noisy. This paper analyzes the root cause of such deficient performance of a popular neural implicit model. We discover that the limitations are due to highly complicated objectives, lack of regularization, and poor initialization. To overcome these issues, we introduce two simple yet effective modifications: (i) a deep encoder that provides a better and more stable initialization for latent code optimization; and (ii) a deep discriminator that serves as a prior model to boost the fidelity of the shape. We evaluate our approach on two real-wold self-driving datasets and show superior performance over state-of-the-art 3D object reconstruction methods.

[50]  arXiv:2101.06865 [pdf, other]
Title: Non-parametric Memory for Spatio-Temporal Segmentation of Construction Zones for Self-Driving
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we introduce a non-parametric memory representation for spatio-temporal segmentation that captures the local space and time around an autonomous vehicle (AV). Our representation has three important properties: (i) it remembers what it has seen in the past, (ii) it reinforces and (iii) forgets its past beliefs based on new evidence. Reinforcing is important as the first time we see an element we might be uncertain, e.g, if the element is heavily occluded or at range. Forgetting is desirable, as otherwise false positives will make the self driving vehicle behave erratically. Our process is informed by 3D reasoning, as occlusion is key to distinguishing between the desire to forget and to remember. We show how our method can be used as an online component to complement static world representations such as HD maps by detecting and remembering changes that should be superimposed on top of this static view due to such events.

[51]  arXiv:2101.06871 [pdf, other]
Title: CheXtransfer: Performance and Parameter Efficiency of ImageNet Models for Chest X-Ray Interpretation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Deep learning methods for chest X-ray interpretation typically rely on pretrained models developed for ImageNet. This paradigm assumes that better ImageNet architectures perform better on chest X-ray tasks and that ImageNet-pretrained weights provide a performance boost over random initialization. In this work, we compare the transfer performance and parameter efficiency of 16 popular convolutional architectures on a large chest X-ray dataset (CheXpert) to investigate these assumptions. First, we find no relationship between ImageNet performance and CheXpert performance for both models without pretraining and models with pretraining. Second, we find that, for models without pretraining, the choice of model family influences performance more than size within a family for medical imaging tasks. Third, we observe that ImageNet pretraining yields a statistically significant boost in performance across architectures, with a higher boost for smaller architectures. Fourth, we examine whether ImageNet architectures are unnecessarily large for CheXpert by truncating final blocks from pretrained models, and find that we can make models 3.25x more parameter-efficient on average without a statistically significant drop in performance. Our work contributes new experimental evidence about the relation of ImageNet to chest x-ray interpretation performance.

[52]  arXiv:2101.06898 [pdf, other]
Title: What Do Deep Nets Learn? Class-wise Patterns Revealed in the Input Space
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Deep neural networks (DNNs) have been widely adopted in different applications to achieve state-of-the-art performance. However, they are often applied as a black box with limited understanding of what the model has learned from the data. In this paper, we focus on image classification and propose a method to visualize and understand the class-wise patterns learned by DNNs trained under three different settings including natural, backdoored and adversarial. Different from existing class-wise deep representation visualizations, our method searches for a single predictive pattern in the input (i.e. pixel) space for each class. Based on the proposed method, we show that DNNs trained on natural (clean) data learn abstract shapes along with some texture, and backdoored models learn a small but highly predictive pattern for the backdoor target class. Interestingly, the existence of class-wise predictive patterns in the input space indicates that even DNNs trained on clean data can have backdoors, and the class-wise patterns identified by our method can be readily applied to "backdoor" attack the model. In the adversarial setting, we show that adversarially trained models learn more simplified shape patterns. Our method can serve as a useful tool to better understand DNNs trained on different datasets under different settings.

[53]  arXiv:2101.06915 [pdf, other]
Title: TLU-Net: A Deep Learning Approach for Automatic Steel Surface Defect Detection
Journal-ref: International Conference on Applied Artificial Intelligence (ICAPAI 2021), Halden, Norway, May 19-21, 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

Visual steel surface defect detection is an essential step in steel sheet manufacturing. Several machine learning-based automated visual inspection (AVI) methods have been studied in recent years. However, most steel manufacturing industries still use manual visual inspection due to training time and inaccuracies involved with AVI methods. Automatic steel defect detection methods could be useful in less expensive and faster quality control and feedback. But preparing the annotated training data for segmentation and classification could be a costly process. In this work, we propose to use the Transfer Learning-based U-Net (TLU-Net) framework for steel surface defect detection. We use a U-Net architecture as the base and explore two kinds of encoders: ResNet and DenseNet. We compare these nets' performance using random initialization and the pre-trained networks trained using the ImageNet data set. The experiments are performed using Severstal data. The results demonstrate that the transfer learning performs 5% (absolute) better than that of the random initialization in defect classification. We found that the transfer learning performs 26% (relative) better than that of the random initialization in defect segmentation. We also found the gain of transfer learning increases as the training data decreases, and the convergence rate with transfer learning is better than that of the random initialization.

[54]  arXiv:2101.06931 [pdf, other]
Title: Label-Efficient Point Cloud Semantic Segmentation: An Active Learning Approach
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Semantic segmentation of 3D point clouds relies on training deep models with a large amount of labeled data. However, labeling 3D point clouds is expensive, thus smart approach towards data annotation, a.k.a. active learning is essential to label-efficient point cloud segmentation. In this work, we first propose a more realistic annotation counting scheme so that a fair benchmark is possible. To better exploit labeling budget, we adopt a super-point based active learning strategy where we make use of manifold defined on the point cloud geometry. We further propose active learning strategy to encourage shape level diversity and local spatial consistency constraint. Experiments on two benchmark datasets demonstrate the efficacy of our proposed active learning strategy for label-efficient semantic segmentation of point clouds. Notably, we achieve significant improvement at all levels of annotation budgets and outperform the state-of-the-art methods under the same level of annotation cost.

[55]  arXiv:2101.06977 [pdf, other]
Title: Semi-Automatic Video Annotation For Object Detection
Comments: Submitted to ICIP 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this study, a semi-automatic video annotation method is proposed, utilizing temporal information to eliminate false-positives with a tracking-by-detection approach by employing multiple hypothesis tracking (MHT). MHT method automatically forms tracklets which are confirmed by human operators to enlarge the training set. A novel incremental learning approach helps to annotate videos in an iterative way. The experiments performed on AUTH Multidrone Dataset reveals that the annotation workload can be reduced up to 96% by the proposed approach.

[56]  arXiv:2101.07017 [pdf, other]
Title: Deep Universal Blind Image Denoising
Comments: Presented in ICPR 2020 (Oral)
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Image denoising is an essential part of many image processing and computer vision tasks due to inevitable noise corruption during image acquisition. Traditionally, many researchers have investigated image priors for the denoising, within the Bayesian perspective based on image properties and statistics. Recently, deep convolutional neural networks (CNNs) have shown great success in image denoising by incorporating large-scale synthetic datasets. However, they both have pros and cons. While the deep CNNs are powerful for removing the noise with known statistics, they tend to lack flexibility and practicality for the blind and real-world noise. Moreover, they cannot easily employ explicit priors. On the other hand, traditional non-learning methods can involve explicit image priors, but they require considerable computation time and cannot exploit large-scale external datasets. In this paper, we present a CNN-based method that leverages the advantages of both methods based on the Bayesian perspective. Concretely, we divide the blind image denoising problem into sub-problems and conquer each inference problem separately. As the CNN is a powerful tool for inference, our method is rooted in CNNs and propose a novel design of network for efficient inference. With our proposed method, we can successfully remove blind and real-world noise, with a moderate number of parameters of universal CNN.

[57]  arXiv:2101.07034 [pdf, other]
Title: Adaptive Graph Representation Learning and Reasoning for Face Parsing
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Face parsing infers a pixel-wise label to each facial component, which has drawn much attention recently. Previous methods have shown their success in face parsing, which however overlook the correlation among facial components. As a matter of fact, the component-wise relationship is a critical clue in discriminating ambiguous pixels in facial area. To address this issue, we propose adaptive graph representation learning and reasoning over facial components, aiming to learn representative vertices that describe each component, exploit the component-wise relationship and thereby produce accurate parsing results against ambiguity. In particular, we devise an adaptive and differentiable graph abstraction method to represent the components on a graph via pixel-to-vertex projection under the initial condition of a predicted parsing map, where pixel features within a certain facial region are aggregated onto a vertex. Further, we explicitly incorporate the image edge as a prior in the model, which helps to discriminate edge and non-edge pixels during the projection, thus leading to refined parsing results along the edges. Then, our model learns and reasons over the relations among components by propagating information across vertices on the graph. Finally, the refined vertex features are projected back to pixel grids for the prediction of the final parsing map. To train our model, we propose a discriminative loss to penalize small distances between vertices in the feature space, which leads to distinct vertices with strong semantics. Experimental results show the superior performance of the proposed model on multiple face parsing datasets, along with the validation on the human parsing task to demonstrate the generalizability of our model.

[58]  arXiv:2101.07042 [pdf, other]
Title: CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Zero-shot action recognition is the task of recognizing action classes without visual examples, only with a semantic embedding which relates unseen to seen classes. The problem can be seen as learning a function which generalizes well to instances of unseen classes without losing discrimination between classes. Neural networks can model the complex boundaries between visual classes, which explains their success as supervised models. However, in zero-shot learning, these highly specialized class boundaries may not transfer well from seen to unseen classes. In this paper, we propose a clustering-based model, which considers all training samples at once, instead of optimizing for each instance individually. We optimize the clustering using Reinforcement Learning which we show is critical for our approach to work. We call the proposed method CLASTER and observe that it consistently improves over the state-of-the-art in all standard datasets, UCF101, HMDB51, and Olympic Sports; both in the standard zero-shot evaluation and the generalized zero-shot learning.

[59]  arXiv:2101.07116 [pdf, other]
Title: LNSMM: Eye Gaze Estimation With Local Network Share Multiview Multitask
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Eye gaze estimation has become increasingly significant in computer vision.In this paper,we systematically study the mainstream of eye gaze estimation methods,propose a novel methodology to estimate eye gaze points and eye gaze directions simultaneously.First,we construct a local sharing network for feature extraction of gaze points and gaze directions estimation,which can reduce network computational parameters and converge quickly;Second,we propose a Multiview Multitask Learning (MTL) framework,for gaze directions,a coplanar constraint is proposed for the left and right eyes,for gaze points,three views data input indirectly introduces eye position information,a cross-view pooling module is designed, propose joint loss which handle both gaze points and gaze directions estimation.Eventually,we collect a dataset to use of gaze points,which have three views to exist public dataset.The experiment show our method is state-of-the-art the current mainstream methods on two indicators of gaze points and gaze directions.

[60]  arXiv:2101.07172 [pdf, other]
Title: HarDNet-MSEG: A Simple Encoder-Decoder Polyp Segmentation Neural Network that Achieves over 0.9 Mean Dice and 86 FPS
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We propose a new convolution neural network called HarDNet-MSEG for polyp segmentation. It achieves SOTA in both accuracy and inference speed on five popular datasets. For Kvasir-SEG, HarDNet-MSEG delivers 0.904 mean Dice running at 86.7 FPS on a GeForce RTX 2080 Ti GPU. It consists of a backbone and a decoder. The backbone is a low memory traffic CNN called HarDNet68, which has been successfully applied to various CV tasks including image classification, object detection, multi-object tracking and semantic segmentation, etc. The decoder part is inspired by the Cascaded Partial Decoder, known for fast and accurate salient object detection. We have evaluated HarDNet-MSEG using those five popular datasets. The code and all experiment details are available at Github. https://github.com/james128333/HarDNet-MSEG

[61]  arXiv:2101.07209 [pdf, other]
Title: Assisting Barrett's esophagus identification using endoscopic data augmentation based on Generative Adversarial Networks
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Computer-aided approaches for automatic diagnosis emerged in the literature since early detection is intrinsically related to remission probabilities. However, they still suffer from drawbacks because of the lack of available data for machine learning purposes, thus implying reduced recognition rates. This work introduces Generative Adversarial Networks to generate high-quality endoscopic images, thereby identifying Barrett's esophagus and adenocarcinoma more precisely. Further, Convolution Neural Networks are used for feature extraction and classification purposes. The proposed approach is validated over two datasets of endoscopic images, with the experiments conducted over the full and patch-split images. The application of Deep Convolutional Generative Adversarial Networks for the data augmentation step and LeNet-5 and AlexNet for the classification step allowed us to validate the proposed methodology over an extensive set of datasets (based on original and augmented sets), reaching results of 90% of accuracy for the patch-based approach and 85% for the image-based approach. Both results are based on augmented datasets and are statistically different from the ones obtained in the original datasets of the same kind. Moreover, the impact of data augmentation was evaluated in the context of image description and classification, and the results obtained using synthetic images outperformed the ones over the original datasets, as well as other recent approaches from the literature. Such results suggest promising insights related to the importance of proper data for the accurate classification concerning computer-assisted Barrett's esophagus and adenocarcinoma detection.

[62]  arXiv:2101.07253 [pdf, other]
Title: Cross-modal Learning for Domain Adaptation in 3D Semantic Segmentation
Comments: arXiv admin note: text overlap with arXiv:1911.12676
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Domain adaptation is an important task to enable learning when labels are scarce. While most works focus only on the image modality, there are many important multi-modal datasets. In order to leverage multi-modality for domain adaptation, we propose cross-modal learning, where we enforce consistency between the predictions of two modalities via mutual mimicking. We constrain our network to make correct predictions on labeled data and consistent predictions across modalities on unlabeled target-domain data. Experiments in unsupervised and semi-supervised domain adaptation settings prove the effectiveness of this novel domain adaptation strategy. Specifically, we evaluate on the task of 3D semantic segmentation using the image and point cloud modality. We leverage recent autonomous driving datasets to produce a wide variety of domain adaptation scenarios including changes in scene layout, lighting, sensor setup and weather, as well as the synthetic-to-real setup. Our method significantly improves over previous uni-modal adaptation baselines on all adaption scenarios. Code will be made available.

Cross-lists for Tue, 19 Jan 21

[63]  arXiv:2002.05283 (cross-list from cs.LG) [pdf, other]
Title: Stabilizing Differentiable Architecture Search via Perturbation-based Regularization
Comments: ICML 2020, code is available at this https URL
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Differentiable architecture search (DARTS) is a prevailing NAS solution to identify architectures. Based on the continuous relaxation of the architecture space, DARTS learns a differentiable architecture weight and largely reduces the search cost. However, its stability has been challenged for yielding deteriorating architectures as the search proceeds. We find that the precipitous validation loss landscape, which leads to a dramatic performance drop when distilling the final architecture, is an essential factor that causes instability. Based on this observation, we propose a perturbation-based regularization - SmoothDARTS (SDARTS), to smooth the loss landscape and improve the generalizability of DARTS-based methods. In particular, our new formulations stabilize DARTS-based methods by either random smoothing or adversarial attack. The search trajectory on NAS-Bench-1Shot1 demonstrates the effectiveness of our approach and due to the improved stability, we achieve performance gain across various search spaces on 4 datasets. Furthermore, we mathematically show that SDARTS implicitly regularizes the Hessian norm of the validation loss, which accounts for a smoother loss landscape and improved performance.

[64]  arXiv:2101.06329 (cross-list from cs.LG) [pdf, other]
Title: In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning
Comments: accepted in ICLR 2021
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

The recent research in semi-supervised learning (SSL) is mostly dominated by consistency regularization based methods which achieve strong performance. However, they heavily rely on domain-specific data augmentations, which are not easy to generate for all data modalities. Pseudo-labeling (PL) is a general SSL approach that does not have this constraint but performs relatively poorly in its original formulation. We argue that PL underperforms due to the erroneous high confidence predictions from poorly calibrated models; these predictions generate many incorrect pseudo-labels, leading to noisy training. We propose an uncertainty-aware pseudo-label selection (UPS) framework which improves pseudo labeling accuracy by drastically reducing the amount of noise encountered in the training process. Furthermore, UPS generalizes the pseudo-labeling process, allowing for the creation of negative pseudo-labels; these negative pseudo-labels can be used for multi-label classification as well as negative learning to improve the single-label classification. We achieve strong performance when compared to recent SSL methods on the CIFAR-10 and CIFAR-100 datasets. Also, we demonstrate the versatility of our method on the video dataset UCF-101 and the multi-label dataset Pascal VOC.

[65]  arXiv:2101.06354 (cross-list from eess.IV) [pdf, other]
Title: A Hitchhiker's Guide to Structural Similarity
Comments: Submitted to IEEE Access on January 8, 2021
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

The Structural Similarity (SSIM) Index is a very widely used image/video quality model that continues to play an important role in the perceptual evaluation of compression algorithms, encoding recipes and numerous other image/video processing algorithms. Several public implementations of the SSIM and Multiscale-SSIM (MS-SSIM) algorithms have been developed, which differ in efficiency and performance. This "bendable ruler" makes the process of quality assessment of encoding algorithms unreliable. To address this situation, we studied and compared the functions and performances of popular and widely used implementations of SSIM, and we also considered a variety of design choices. Based on our studies and experiments, we have arrived at a collection of recommendations on how to use SSIM most effectively, including ways to reduce its computational burden.

[66]  arXiv:2101.06383 (cross-list from cs.MM) [pdf]
Title: A Novel Local Binary Pattern Based Blind Feature Image Steganography
Journal-ref: Multimedia Tools and Applications, vol-79, no-27-28, pp. 19561-19574, 2020
Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)

Steganography methods in general terms tend to embed more and more secret bits in the cover images. Most of these methods are designed to embed secret information in such a way that the change in the visual quality of the resulting stego image is not detectable. There exists some methods which preserve the global structure of the cover after embedding. However, the embedding capacity of these methods is very less. In this paper a novel feature based blind image steganography technique is proposed, which preserves the LBP (Local binary pattern) feature of the cover with comparable embedding rates. Local binary pattern is a well known image descriptor used for image representation. The proposed scheme computes the local binary pattern to hide the bits of the secret image in such a way that the local relationship that exists in the cover are preserved in the resulting stego image. The performance of the proposed steganography method has been tested on several images of different types to show the robustness. State of the art LSB based steganography methods are compared with the proposed method to show the effectiveness of feature based image steganography

[67]  arXiv:2101.06395 (cross-list from cs.LG) [pdf, other]
Title: Free Lunch for Few-shot Learning: Distribution Calibration
Authors: Shuo Yang, Lu Liu, Min Xu
Comments: ICLR 2021 oral paper, code is available at this https URL
Journal-ref: The 9th International Conference on Learning Representations (ICLR 2021)
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Learning from a limited number of samples is challenging since the learned model can easily become overfitted based on the biased distribution formed by only a few training examples. In this paper, we calibrate the distribution of these few-sample classes by transferring statistics from the classes with sufficient examples, then an adequate number of examples can be sampled from the calibrated distribution to expand the inputs to the classifier. We assume every dimension in the feature representation follows a Gaussian distribution so that the mean and the variance of the distribution can borrow from that of similar classes whose statistics are better estimated with an adequate number of samples. Our method can be built on top of off-the-shelf pretrained feature extractors and classification models without extra parameters. We show that a simple logistic regression classifier trained using the features sampled from our calibrated distribution can outperform the state-of-the-art accuracy on two datasets (~5% improvement on miniImageNet compared to the next best). The visualization of these generated features demonstrates that our calibrated distribution is an accurate estimation.

[68]  arXiv:2101.06414 (cross-list from cs.RO) [pdf, other]
Title: Towards Deep Learning Assisted Autonomous UAVs for Manipulation Tasks in GPS-Denied Environments
Comments: 8 pages, 5 figures, 5 tables, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

In this work, we present a pragmatic approach to enable unmanned aerial vehicle (UAVs) to autonomously perform highly complicated tasks of object pick and place. This paper is largely inspired by challenge-2 of MBZIRC 2020 and is primarily focused on the task of assembling large 3D structures in outdoors and GPS-denied environments. Primary contributions of this system are: (i) a novel computationally efficient deep learning based unified multi-task visual perception system for target localization, part segmentation, and tracking, (ii) a novel deep learning based grasp state estimation, (iii) a retracting electromagnetic gripper design, (iv) a remote computing approach which exploits state-of-the-art MIMO based high speed (5000Mb/s) wireless links to allow the UAVs to execute compute intensive tasks on remote high end compute servers, and (v) system integration in which several system components are weaved together in order to develop an optimized software stack. We use DJI Matrice-600 Pro, a hex-rotor UAV and interface it with the custom designed gripper. Our framework is deployed on the specified UAV in order to report the performance analysis of the individual modules. Apart from the manipulation system, we also highlight several hidden challenges associated with the UAVs in this context.

[69]  arXiv:2101.06425 (cross-list from eess.IV) [pdf, other]
Title: Morphological Change Forecasting for Prostate Glands using Feature-based Registration and Kernel Density Extrapolation
Comments: Accepted by ISBI 2021
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Organ morphology is a key indicator for prostate disease diagnosis and prognosis. For instance, In longitudinal study of prostate cancer patients under active surveillance, the volume, boundary smoothness and their changes are closely monitored on time-series MR image data. In this paper, we describe a new framework for forecasting prostate morphological changes, as the ability to detect such changes earlier than what is currently possible may enable timely treatment or avoiding unnecessary confirmatory biopsies. In this work, an efficient feature-based MR image registration is first developed to align delineated prostate gland capsules to quantify the morphological changes using the inferred dense displacement fields (DDFs). We then propose to use kernel density estimation (KDE) of the probability density of the DDF-represented \textit{future morphology changes}, between current and future time points, before the future data become available. The KDE utilises a novel distance function that takes into account morphology, stage-of-progression and duration-of-change, which are considered factors in such subject-specific forecasting. We validate the proposed approach on image masks unseen to registration network training, without using any data acquired at the future target time points. The experiment results are presented on a longitudinal data set with 331 images from 73 patients, yielding an average Dice score of 0.865 on a holdout set, between the ground-truth and the image masks warped by the KDE-predicted-DDFs.

[70]  arXiv:2101.06440 (cross-list from eess.IV) [pdf, other]
Title: Scale factor point spread function matching: Beyond aliasing in image resampling
Comments: Published in MICCAI 2015
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Imaging devices exploit the Nyquist-Shannon sampling theorem to avoid both aliasing and redundant oversampling by design. Conversely, in medical image resampling, images are considered as continuous functions, are warped by a spatial transformation, and are then sampled on a regular grid. In most cases, the spatial warping changes the frequency characteristics of the continuous function and no special care is taken to ensure that the resampling grid respects the conditions of the sampling theorem. This paper shows that this oversight introduces artefacts, including aliasing, that can lead to important bias in clinical applications. One notable exception to this common practice is when multi-resolution pyramids are constructed, with low-pass "anti-aliasing" filters being applied prior to downsampling. In this work, we illustrate why similar caution is needed when resampling images under general spatial transformations and propose a novel method that is more respectful of the sampling theorem, minimising aliasing and loss of information. We introduce the notion of scale factor point spread function (sfPSF) and employ Gaussian kernels to achieve a computationally tractable resampling scheme that can cope with arbitrary non-linear spatial transformations and grid sizes. Experiments demonstrate significant (p<1e-4) technical and clinical implications of the proposed method.

[71]  arXiv:2101.06459 (cross-list from cs.LG) [pdf, other]
Title: Robustness to Augmentations as a Generalization metric
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Generalization is the ability of a model to predict on unseen domains and is a fundamental task in machine learning. Several generalization bounds, both theoretical and empirical have been proposed but they do not provide tight bounds .In this work, we propose a simple yet effective method to predict the generalization performance of a model by using the concept that models that are robust to augmentations are more generalizable than those which are not. We experiment with several augmentations and composition of augmentations to check the generalization capacity of a model. We also provide a detailed motivation behind the proposed method. The proposed generalization metric is calculated based on the change in the output of the model after augmenting the input. The proposed method was the first runner up solution for the NeurIPS competition on Predicting Generalization in Deep Learning.

[72]  arXiv:2101.06468 (cross-list from eess.IV) [pdf, other]
Title: Adversarial cycle-consistent synthesis of cerebral microbleeds for data augmentation
Comments: Accepted in Medical Imaging meets NIPS Workshop, NIPS 2020
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

We propose a novel framework for controllable pathological image synthesis for data augmentation. Inspired by CycleGAN, we perform cycle-consistent image-to-image translation between two domains: healthy and pathological. Guided by a semantic mask, an adversarially trained generator synthesizes pathology on a healthy image in the specified location. We demonstrate our approach on an institutional dataset of cerebral microbleeds in traumatic brain injury patients. We utilize synthetic images generated with our method for data augmentation in cerebral microbleeds detection. Enriching the training dataset with synthetic images exhibits the potential to increase detection performance for cerebral microbleeds in traumatic brain injury patients.

[73]  arXiv:2101.06474 (cross-list from eess.IV) [pdf, other]
Title: Optimized and autonomous machine learning framework for characterizing pores, particles, grains and grain boundaries in microstructural images
Subjects: Image and Video Processing (eess.IV); Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV)

Additively manufactured metals exhibit heterogeneous microstructure which dictates their material and failure properties. Experimental microstructural characterization techniques generate a large amount of data that requires expensive computationally resources. In this work, an optimized machine learning (ML) framework is proposed to autonomously and efficiently characterize pores, particles, grains and grain boundaries (GBs) from a given microstructure image. First, using a classifier Convolutional Neural Network (CNN), defects such as pores, powder particles, or GBs were recognized from a given microstructure. Depending on the type of defect, two different processes were used. For powder particles or pores, binary segmentations were generated using an optimized Convolutional Encoder-Decoder Network (CEDN). The binary segmentations were used to used obtain particle and pore size and bounding boxes using an object detection ML network (YOLOv5). For GBs, another optimized CEDN was developed to generate RGB segmentation images, which were used to obtain grain size distribution using two regression CNNS. To optimize the RGB CEDN, the Deep Emulator Network SEarch (DENSE) method which employs the Covariance Matrix Adaptation - Evolution Strategy (CMA-ES) was implemented. The optimized RGB segmentation network showed a substantial reduction in training time and GPU usage compared to the unoptimized network, while maintaining high accuracy. Lastly, the proposed framework showed a significant improvement in analysis time when compared to conventional methods.

[74]  arXiv:2101.06480 (cross-list from cs.LG) [pdf, other]
Title: SelfMatch: Combining Contrastive Self-Supervision and Consistency for Semi-Supervised Learning
Comments: 4 pages, NeurIPS 2020 Workshop: Self-Supervised Learning - Theory and Practice
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

This paper introduces SelfMatch, a semi-supervised learning method that combines the power of contrastive self-supervised learning and consistency regularization. SelfMatch consists of two stages: (1) self-supervised pre-training based on contrastive learning and (2) semi-supervised fine-tuning based on augmentation consistency regularization. We empirically demonstrate that SelfMatch achieves the state-of-the-art results on standard benchmark datasets such as CIFAR-10 and SVHN. For example, for CIFAR-10 with 40 labeled examples, SelfMatch achieves 93.19% accuracy that outperforms the strong previous methods such as MixMatch (52.46%), UDA (70.95%), ReMixMatch (80.9%), and FixMatch (86.19%). We note that SelfMatch can close the gap between supervised learning (95.87%) and semi-supervised learning (93.19%) by using only a few labels for each class.

[75]  arXiv:2101.06507 (cross-list from cs.LG) [pdf, other]
Title: Multi-objective Search of Robust Neural Architectures against Multiple Types of Adversarial Attacks
Authors: Jia Liu, Yaochu Jin
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

Many existing deep learning models are vulnerable to adversarial examples that are imperceptible to humans. To address this issue, various methods have been proposed to design network architectures that are robust to one particular type of adversarial attacks. It is practically impossible, however, to predict beforehand which type of attacks a machine learn model may suffer from. To address this challenge, we propose to search for deep neural architectures that are robust to five types of well-known adversarial attacks using a multi-objective evolutionary algorithm. To reduce the computational cost, a normalized error rate of a randomly chosen attack is calculated as the robustness for each newly generated neural architecture at each generation. All non-dominated network architectures obtained by the proposed method are then fully trained against randomly chosen adversarial attacks and tested on two widely used datasets. Our experimental results demonstrate the superiority of optimized neural architectures found by the proposed approach over state-of-the-art networks that are widely used in the literature in terms of the classification accuracy under different adversarial attacks.

[76]  arXiv:2101.06547 (cross-list from cs.RO) [pdf, other]
Title: LookOut: Diverse Multi-Future Prediction and Planning for Self-Driving
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Self-driving vehicles need to anticipate a diverse set of future traffic scenarios in order to safely share the road with other traffic participants that may exhibit rare but dangerous driving. In this paper, we present LookOut, an approach to jointly perceive the environment and predict a diverse set of futures from sensor data, estimate their probability, and optimize a contingency plan over these diverse future realizations. In particular, we learn a diverse joint distribution over multi-agent future trajectories in a traffic scene that allows us to cover a wide range of future modes with high sample efficiency while leveraging the expressive power of generative models. Unlike previous work in diverse motion forecasting, our diversity objective explicitly rewards sampling future scenarios that require distinct reactions from the self-driving vehicle for improved safety. Our contingency planner then finds comfortable trajectories that ensure safe reactions to a wide range of future scenarios. Through extensive evaluations, we show that our model demonstrates significantly more diverse and sample-efficient motion forecasting in a large-scale self-driving dataset as well as safer and more comfortable motion plans in long-term closed-loop simulations than current state-of-the-art models.

[77]  arXiv:2101.06549 (cross-list from cs.RO) [pdf, other]
Title: AdvSim: Generating Safety-Critical Scenarios for Self-Driving Vehicles
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

As self-driving systems become better, simulating scenarios where the autonomy stack is likely to fail becomes of key importance. Traditionally, those scenarios are generated for a few scenes with respect to the planning module that takes ground-truth actor states as input. This does not scale and cannot identify all possible autonomy failures, such as perception failures due to occlusion. In this paper, we propose AdvSim, an adversarial framework to generate safety-critical scenarios for any LiDAR-based autonomy system. Given an initial traffic scenario, AdvSim modifies the actors' trajectories in a physically plausible manner and updates the LiDAR sensor data to create realistic observations of the perturbed world. Importantly, by simulating directly from sensor data, we obtain adversarial scenarios that are safety-critical for the full autonomy stack. Our experiments show that our approach is general and can identify thousands of semantically meaningful safety-critical scenarios for a wide range of modern self-driving systems. Furthermore, we show that the robustness and safety of these autonomy systems can be further improved by training them with scenarios generated by AdvSim.

[78]  arXiv:2101.06557 (cross-list from cs.RO) [pdf, other]
Title: TrafficSim: Learning to Simulate Realistic Multi-Agent Behaviors
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Simulation has the potential to massively scale evaluation of self-driving systems enabling rapid development as well as safe deployment. To close the gap between simulation and the real world, we need to simulate realistic multi-agent behaviors. Existing simulation environments rely on heuristic-based models that directly encode traffic rules, which cannot capture irregular maneuvers (e.g., nudging, U-turns) and complex interactions (e.g., yielding, merging). In contrast, we leverage real-world data to learn directly from human demonstration and thus capture a more diverse set of actor behaviors. To this end, we propose TrafficSim, a multi-agent behavior model for realistic traffic simulation. In particular, we leverage an implicit latent variable model to parameterize a joint actor policy that generates socially-consistent plans for all actors in the scene jointly. To learn a robust policy amenable for long horizon simulation, we unroll the policy in training and optimize through the fully differentiable simulation across time. Our learning objective incorporates both human demonstrations as well as common sense. We show TrafficSim generates significantly more realistic and diverse traffic scenarios as compared to a diverse set of baselines. Notably, we can exploit trajectories generated by TrafficSim as effective data augmentation for training better motion planner.

[79]  arXiv:2101.06560 (cross-list from cs.LG) [pdf, other]
Title: Adversarial Attacks On Multi-Agent Communication
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

Growing at a very fast pace, modern autonomous systems will soon be deployed at scale, opening up the possibility for cooperative multi-agent systems. By sharing information and distributing workloads, autonomous agents can better perform their tasks and enjoy improved computation efficiency. However, such advantages rely heavily on communication channels which have been shown to be vulnerable to security breaches. Thus, communication can be compromised to execute adversarial attacks on deep learning models which are widely employed in modern systems. In this paper, we explore such adversarial attacks in a novel multi-agent setting where agents communicate by sharing learned intermediate representations. We observe that an indistinguishable adversarial message can severely degrade performance, but becomes weaker as the number of benign agents increase. Furthermore, we show that transfer attacks are more difficult in this setting when compared to directly perturbing the inputs, as it is necessary to align the distribution of communication messages with domain adaptation. Finally, we show that low-budget online attacks can be achieved by exploiting the temporal consistency of streaming sensory inputs.

[80]  arXiv:2101.06562 (cross-list from cs.RO) [pdf, other]
Title: Asynchronous Multi-View SLAM
Comments: 23 pages, 23 figures, 13 tables
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Existing multi-camera SLAM systems assume synchronized shutters for all cameras, which is often not the case in practice. In this work, we propose a generalized multi-camera SLAM formulation which accounts for asynchronous sensor observations. Our framework integrates a continuous-time motion model to relate information across asynchronous multi-frames during tracking, local mapping, and loop closing. For evaluation, we collected AMV-Bench, a challenging new SLAM dataset covering 482 km of driving recorded using our asynchronous multi-camera robotic platform. AMV-Bench is over an order of magnitude larger than previous multi-view HD outdoor SLAM datasets, and covers diverse and challenging motions and environments. Our experiments emphasize the necessity of asynchronous sensor modeling, and show that the use of multiple cameras is critical towards robust and accurate SLAM in challenging outdoor scenes.

[81]  arXiv:2101.06590 (cross-list from cs.LG) [pdf, other]
Title: Cost-Efficient Online Hyperparameter Optimization
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Recent work on hyperparameters optimization (HPO) has shown the possibility of training certain hyperparameters together with regular parameters. However, these online HPO algorithms still require running evaluation on a set of validation examples at each training step, steeply increasing the training cost. To decide when to query the validation loss, we model online HPO as a time-varying Bayesian optimization problem, on top of which we propose a novel \textit{costly feedback} setting to capture the concept of the query cost. Under this setting, standard algorithms are cost-inefficient as they evaluate on the validation set at every round. In contrast, the cost-efficient GP-UCB algorithm proposed in this paper queries the unknown function only when the model is less confident about current decisions. We evaluate our proposed algorithm by tuning hyperparameters online for VGG and ResNet on CIFAR-10 and ImageNet100. Our proposed online HPO algorithm reaches human expert-level performance within a single run of the experiment, while incurring only modest computational overhead compared to regular training.

[82]  arXiv:2101.06639 (cross-list from cs.LG) [pdf, other]
Title: Removing Undesirable Feature Contributions Using Out-of-Distribution Data
Comments: Published as a conference paper at ICLR 2021
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Several data augmentation methods deploy unlabeled-in-distribution (UID) data to bridge the gap between the training and inference of neural networks. However, these methods have clear limitations in terms of availability of UID data and dependence of algorithms on pseudo-labels. Herein, we propose a data augmentation method to improve generalization in both adversarial and standard learning by using out-of-distribution (OOD) data that are devoid of the abovementioned issues. We show how to improve generalization theoretically using OOD data in each learning scenario and complement our theoretical analysis with experiments on CIFAR-10, CIFAR-100, and a subset of ImageNet. The results indicate that undesirable features are shared even among image data that seem to have little correlation from a human point of view. We also present the advantages of the proposed method through comparison with other data augmentation methods, which can be used in the absence of UID data. Furthermore, we demonstrate that the proposed method can further improve the existing state-of-the-art adversarial training.

[83]  arXiv:2101.06704 (cross-list from cs.AI) [pdf, other]
Title: Adversarial Interaction Attack: Fooling AI to Misinterpret Human Intentions
Comments: Preprint
Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Understanding the actions of both humans and artificial intelligence (AI) agents is important before modern AI systems can be fully integrated into our daily life. In this paper, we show that, despite their current huge success, deep learning based AI systems can be easily fooled by subtle adversarial noise to misinterpret the intention of an action in interaction scenarios. Based on a case study of skeleton-based human interactions, we propose a novel adversarial attack on interactions, and demonstrate how DNN-based interaction models can be tricked to predict the participants' reactions in unexpected ways. From a broader perspective, the scope of our proposed attack method is not confined to problems related to skeleton data but can also be extended to any type of problems involving sequential regressions. Our study highlights potential risks in the interaction loop with AI and humans, which need to be carefully addressed when deploying AI systems in safety-critical applications.

[84]  arXiv:2101.06772 (cross-list from eess.IV) [pdf, other]
Title: Latent Space Analysis of VAE and Intro-VAE applied to 3-dimensional MR Brain Volumes of Multiple Sclerosis, Leukoencephalopathy, and Healthy Patients
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Multiple Sclerosis (MS) and microvascular leukoencephalopathy are two distinct neurological conditions, the first caused by focal autoimmune inflammation in the central nervous system, the second caused by chronic white matter damage from atherosclerotic microvascular disease. Both conditions lead to signal anomalies on Fluid Attenuated Inversion Recovery (FLAIR) magnetic resonance (MR) images, which can be distinguished by an expert neuroradiologist, but which can look very similar to the untrained eye as well as in the early stage of both diseases. In this paper, we attempt to train a 3-dimensional deep neural network to learn the specific features of both diseases in an unsupervised manner. For this manner, in a first step we train a generative neural network to create artificial MR images of both conditions with approximate explicit density, using a mixed dataset of multiple sclerosis, leukoencephalopathy and healthy patients containing in total 5404 volumes of 3096 patients. In a second step, we distinguish features between the different diseases in the latent space of this network, and use them to classify new data.

[85]  arXiv:2101.06775 (cross-list from eess.IV) [pdf, other]
Title: Symmetric-Constrained Irregular Structure Inpainting for Brain MRI Registration with Tumor Pathology
Comments: Published at MICCAI Brainles 2020
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Deformable registration of magnetic resonance images between patients with brain tumors and healthy subjects has been an important tool to specify tumor geometry through location alignment and facilitate pathological analysis. Since tumor region does not match with any ordinary brain tissue, it has been difficult to deformably register a patients brain to a normal one. Many patient images are associated with irregularly distributed lesions, resulting in further distortion of normal tissue structures and complicating registration's similarity measure. In this work, we follow a multi-step context-aware image inpainting framework to generate synthetic tissue intensities in the tumor region. The coarse image-to-image translation is applied to make a rough inference of the missing parts. Then, a feature-level patch-match refinement module is applied to refine the details by modeling the semantic relevance between patch-wise features. A symmetry constraint reflecting a large degree of anatomical symmetry in the brain is further proposed to achieve better structure understanding. Deformable registration is applied between inpainted patient images and normal brains, and the resulting deformation field is eventually used to deform original patient data for the final alignment. The method was applied to the Multimodal Brain Tumor Segmentation (BraTS) 2018 challenge database and compared against three existing inpainting methods. The proposed method yielded results with increased peak signal-to-noise ratio, structural similarity index, inception score, and reduced L1 error, leading to successful patient-to-normal brain image registration.

[86]  arXiv:2101.06806 (cross-list from cs.RO) [pdf, other]
Title: MP3: A Unified Model to Map, Perceive, Predict and Plan
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

High-definition maps (HD maps) are a key component of most modern self-driving systems due to their valuable semantic and geometric information. Unfortunately, building HD maps has proven hard to scale due to their cost as well as the requirements they impose in the localization system that has to work everywhere with centimeter-level accuracy. Being able to drive without an HD map would be very beneficial to scale self-driving solutions as well as to increase the failure tolerance of existing ones (e.g., if localization fails or the map is not up-to-date). Towards this goal, we propose MP3, an end-to-end approach to mapless driving where the input is raw sensor data and a high-level command (e.g., turn left at the intersection). MP3 predicts intermediate representations in the form of an online map and the current and future state of dynamic agents, and exploits them in a novel neural motion planner to make interpretable decisions taking into account uncertainty. We show that our approach is significantly safer, more comfortable, and can follow commands better than the baselines in challenging long-term closed-loop simulations, as well as when compared to an expert driver in a large-scale real-world dataset.

[87]  arXiv:2101.06848 (cross-list from cs.AI) [pdf, other]
Title: Faster Convergence in Deep-Predictive-Coding Networks to Learn Deeper Representations
Comments: Submitted to IEEE TNNLS
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

Deep-predictive-coding networks (DPCNs) are hierarchical, generative models that rely on feed-forward and feed-back connections to modulate latent feature representations of stimuli in a dynamic and context-sensitive manner. A crucial element of DPCNs is a forward-backward inference procedure to uncover sparse states of a dynamic model, which are used for invariant feature extraction. However, this inference and the corresponding backwards network parameter updating are major computational bottlenecks. They severely limit the network depths that can be reasonably implemented and easily trained. We therefore propose a optimization strategy, with better empirical and theoretical convergence, based on accelerated proximal gradients.
We demonstrate that the ability to construct deeper DPCNs leads to receptive fields that capture well the entire notions of objects on which the networks are trained. This improves the feature representations. It yields completely unsupervised classifiers that surpass convolutional and convolutional-recurrent autoencoders and are on par with convolutional networks trained in a supervised manner. This is despite the DPCNs having orders of magnitude fewer parameters.

[88]  arXiv:2101.06853 (cross-list from eess.IV) [pdf, other]
Title: Deep Symmetric Adaptation Network for Cross-modality Medical Image Segmentation
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Unsupervised domain adaptation (UDA) methods have shown their promising performance in the cross-modality medical image segmentation tasks. These typical methods usually utilize a translation network to transform images from the source domain to target domain or train the pixel-level classifier merely using translated source images and original target images. However, when there exists a large domain shift between source and target domains, we argue that this asymmetric structure could not fully eliminate the domain gap. In this paper, we present a novel deep symmetric architecture of UDA for medical image segmentation, which consists of a segmentation sub-network, and two symmetric source and target domain translation sub-networks. To be specific, based on two translation sub-networks, we introduce a bidirectional alignment scheme via a shared encoder and private decoders to simultaneously align features 1) from source to target domain and 2) from target to source domain, which helps effectively mitigate the discrepancy between domains. Furthermore, for the segmentation sub-network, we train a pixel-level classifier using not only original target images and translated source images, but also original source images and translated target images, which helps sufficiently leverage the semantic information from the images with different styles. Extensive experiments demonstrate that our method has remarkable advantages compared to the state-of-the-art methods in both cross-modality Cardiac and BraTS segmentation tasks.

[89]  arXiv:2101.06894 (cross-list from cs.RO) [pdf, other]
Title: Kimera: from SLAM to Spatial Perception with 3D Dynamic Scene Graphs
Comments: 34 pages, 25 figures, 9 tables. arXiv admin note: text overlap with arXiv:2002.06289
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Humans are able to form a complex mental model of the environment they move in. This mental model captures geometric and semantic aspects of the scene, describes the environment at multiple levels of abstractions (e.g., objects, rooms, buildings), includes static and dynamic entities and their relations (e.g., a person is in a room at a given time). In contrast, current robots' internal representations still provide a partial and fragmented understanding of the environment, either in the form of a sparse or dense set of geometric primitives (e.g., points, lines, planes, voxels) or as a collection of objects. This paper attempts to reduce the gap between robot and human perception by introducing a novel representation, a 3D Dynamic Scene Graph(DSG), that seamlessly captures metric and semantic aspects of a dynamic environment. A DSG is a layered graph where nodes represent spatial concepts at different levels of abstraction, and edges represent spatio-temporal relations among nodes. Our second contribution is Kimera, the first fully automatic method to build a DSG from visual-inertial data. Kimera includes state-of-the-art techniques for visual-inertial SLAM, metric-semantic 3D reconstruction, object localization, human pose and shape estimation, and scene parsing. Our third contribution is a comprehensive evaluation of Kimera in real-life datasets and photo-realistic simulations, including a newly released dataset, uHumans2, which simulates a collection of crowded indoor and outdoor scenes. Our evaluation shows that Kimera achieves state-of-the-art performance in visual-inertial SLAM, estimates an accurate 3D metric-semantic mesh model in real-time, and builds a DSG of a complex indoor environment with tens of objects and humans in minutes. Our final contribution shows how to use a DSG for real-time hierarchical semantic path-planning. The core modules in Kimera are open-source.

[90]  arXiv:2101.06910 (cross-list from eess.IV) [pdf]
Title: A Novel Registration & Colorization Technique for Thermal to Cross Domain Colorized Images
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Thermal images can be obtained as either grayscale images or pseudo colored images based on the thermal profile of the object being captured. We present a novel registration method that works on images captured via multiple thermal imagers irrespective of make and internal resolution as well as a colorization scheme that can be used to obtain a colorized thermal image which is similar to an optical image, while retaining the information of the thermal profile as a part of the output, thus providing information of both domains jointly. We call this a cross domain colorized image. We also outline a new public thermal-optical paired database that we are presenting as a part of this paper, containing unique data points obtained via multiple thermal imagers. Finally, we compare the results with prior literature, show how our results are different and discuss on some future work that can be explored further in this domain as well.

[91]  arXiv:2101.06958 (cross-list from eess.IV) [pdf]
Title: Covid-19 classification with deep neural network and belief functions
Comments: medical image, Covid-19, belief function, BIHI conference
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Computed tomography (CT) image provides useful information for radiologists to diagnose Covid-19. However, visual analysis of CT scans is time-consuming. Thus, it is necessary to develop algorithms for automatic Covid-19 detection from CT images. In this paper, we propose a belief function-based convolutional neural network with semi-supervised training to detect Covid-19 cases. Our method first extracts deep features, maps them into belief degree maps and makes the final classification decision. Our results are more reliable and explainable than those of traditional deep learning-based classification models. Experimental results show that our approach is able to achieve a good performance with an accuracy of 0.81, an F1 of 0.812 and an AUC of 0.875.

[92]  arXiv:2101.06963 (cross-list from eess.IV) [pdf, other]
Title: Uncertainty-Aware Body Composition Analysis with Deep Regression Ensembles on UK Biobank MRI
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Purpose: To enable fast and automated analysis of body composition from UK Biobank MRI with accurate estimates of individual measurement errors.
Methods: In an ongoing large-scale imaging study the UK Biobank has acquired MRI of over 40,000 men and women aged 44-82. Phenotypes derived from these images, such as body composition, can reveal new links between genetics, cardiovascular disease, and metabolic conditions. In this retrospective study, neural networks were trained to provide six measurements of body composition from UK Biobank neck-to-knee body MRI. A ResNet50 architecture can automatically predict these values by image-based regression, but may also produce erroneous outliers. Predictive uncertainty, which could identify these failure cases, was therefore modeled with a mean-variance loss and ensembling. Its estimates of individual prediction errors were evaluated in cross-validation on over 8,000 subjects, tested on another 1,000 cases, and finally applied for inference.
Results: Relative measurement errors below 5\% were achieved on all but one target, for intra-class correlation coefficients (ICC) above 0.97 both in validation and testing. Both mean-variance loss and ensembling yielded improvements and provided uncertainty estimates that highlighted some of the worst outlier predictions. Combined, they reached the highest quality, but also exhibited a consistent bias towards high uncertainty in heavyweight subjects.
Conclusion: Mean-variance regression and ensembling provided complementary benefits for automated body composition measurements from UK Biobank MRI, reaching high speed and accuracy. These values were inferred for the entire cohort, with uncertainty estimates that can approximate the measurement errors and identify some of the worst outliers automatically.

[93]  arXiv:2101.06969 (cross-list from cs.CL) [pdf, other]
Title: Red Alarm for Pre-trained Models: Universal Vulnerabilities by Neuron-Level Backdoor Attacks
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

Due to the success of pre-trained models (PTMs), people usually fine-tune an existing PTM for downstream tasks. Most of PTMs are contributed and maintained by open sources and may suffer from backdoor attacks. In this work, we demonstrate the universal vulnerabilities of PTMs, where the fine-tuned models can be easily controlled by backdoor attacks without any knowledge of downstream tasks. Specifically, the attacker can add a simple pre-training task to restrict the output hidden states of the trigger instances to the pre-defined target embeddings, namely neuron-level backdoor attack (NeuBA). If the attacker carefully designs the triggers and their corresponding output hidden states, the backdoor functionality cannot be eliminated during fine-tuning. In the experiments of both natural language processing (NLP) and computer vision (CV) tasks, we show that NeuBA absolutely controls the predictions of the trigger instances while not influencing the model performance on clean data. Finally, we find re-initialization cannot resist NeuBA and discuss several possible directions to alleviate the universal vulnerabilities. Our findings sound a red alarm for the wide use of PTMs. Our source code and data can be accessed at \url{https://github.com/thunlp/NeuBA}.

[94]  arXiv:2101.06979 (cross-list from eess.IV) [pdf, other]
Title: Comparing Deep Learning strategies for paired but unregistered multimodal segmentation of the liver in T1 and T2-weighted MRI
Comments: 4 pages, 3 figures and 3 tables. Conference paper
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

We address the problem of multimodal liver segmentation in paired but unregistered T1 and T2-weighted MR images. We compare several strategies described in the literature, with or without multi-task training, with or without pre-registration. We also compare different loss functions (cross-entropy, Dice loss, and three adversarial losses). All methods achieved comparable performances with the exception of a multi-task setting that performs both segmentations at once, which performed poorly.

[95]  arXiv:2101.07005 (cross-list from cs.CE) [pdf]
Title: Optical Flow Method for Measuring Deformation of Soil Specimen Subjected to Torsional Shearing
Subjects: Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)

In this study optical flow method is used for soil deformation measurement in laboratory tests. The main objective was to observe how the deformation distributes along the whole height of cylindrical soil sample subjected to torsional shearing (TS test). The experiments were conducted on dry non-cohesive soil samples under two different values of isotropic pressure. Samples were loaded with low-amplitude cyclic torque to analyze the deformation within the small strain range (0.001-0.01%). Optical flow method variant developed by Ce Liu (2009) was used for motion estimation from time-ordered series of images. This algorithm uses scale-invariant feature transform (SIFT) for image feature extraction and coarse-to-fine matching scheme for faster calculations. The results show that while the displacement values change approximately monotonically along sample's height, displacement field is very different for samples under different isotropic pressure. Moreover, the deviations from assumed linearity distribute differently during different stages of the same TS test.

[96]  arXiv:2101.07036 (cross-list from eess.IV) [pdf, other]
Title: Iterative Facial Image Inpainting using Cyclic Reverse Generator
Comments: This paper is under consideration at Neural Computing and Applications Journal
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Facial image inpainting is a challenging problem as it requires generating new pixels that include semantic information for masked key components in a face, e.g., eyes and nose. Recently, remarkable methods have been proposed in this field. Most of these approaches use encoder-decoder architectures and have different limitations such as allowing unique results for a given image and a particular mask. Alternatively, some approaches generate promising results using different masks with generator networks. However, these approaches are optimization-based and usually require quite a number of iterations. In this paper, we propose an efficient solution to the facial image painting problem using the Cyclic Reverse Generator (CRG) architecture, which provides an encoder-generator model. We use the encoder to embed a given image to the generator space and incrementally inpaint the masked regions until a plausible image is generated; a discriminator network is utilized to assess the generated images during the iterations. We empirically observed that only a few iterations are sufficient to generate realistic images with the proposed model. After the generation process, for the post processing, we utilize a Unet model that we trained specifically for this task to remedy the artifacts close to the mask boundaries. Our method allows applying sketch-based inpaintings, using variety of mask types, and producing multiple and diverse results. We qualitatively compared our method with the state-of-the-art models and observed that our method can compete with the other models in all mask types; it is particularly better in images where larger masks are utilized.

[97]  arXiv:2101.07195 (cross-list from eess.IV) [pdf]
Title: A New Approach for Automatic Segmentation and Evaluation of Pigmentation Lesion by using Active Contour Model and Speeded Up Robust Features
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Digital image processing techniques have wide applications in different scientific fields including the medicine. By use of image processing algorithms, physicians have been more successful in diagnosis of different diseases and have achieved much better treatment results. In this paper, we propose an automatic method for segmenting the skin lesions and extracting features that are associated to them. At this aim, a combination of Speeded-Up Robust Features (SURF) and Active Contour Model (ACM), is used. In the suggested method, at first region of skin lesion is segmented from the whole skin image, and then some features like the mean, variance, RGB and HSV parameters are extracted from the segmented region. Comparing the segmentation results, by use of Otsu thresholding, our proposed method, shows the superiority of our procedure over the Otsu theresholding method. Segmentation of the skin lesion by the proposed method and Otsu thresholding compared the results with physician's manual method. The proposed method for skin lesion segmentation, which is a combination of SURF and ACM, gives the best result. For empirical evaluation of our method, we have applied it on twenty different skin lesion images. Obtained results confirm the high performance, speed and accuracy of our method.

[98]  arXiv:2101.07235 (cross-list from stat.ML) [pdf, other]
Title: Reducing bias and increasing utility by federated generative modeling of medical images using a centralized adversary
Comments: 10 pages, 10 figures
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)

We introduce FELICIA (FEderated LearnIng with a CentralIzed Adversary) a generative mechanism enabling collaborative learning. In particular, we show how a data owner with limited and biased data could benefit from other data owners while keeping data from all the sources private. This is a common scenario in medical image analysis where privacy legislation prevents data from being shared outside local premises. FELICIA works for a large family of Generative Adversarial Networks (GAN) architectures including vanilla and conditional GANs as demonstrated in this work. We show that by using the FELICIA mechanism, a data owner with limited image samples can generate high-quality synthetic images with high utility while neither data owners has to provide access to its data. The sharing happens solely through a central discriminator that has access limited to synthetic data. Here, utility is defined as classification performance on a real test set. We demonstrate these benefits on several realistic healthcare scenarions using benchmark image datasets (MNIST, CIFAR-10) as well as on medical images for the task of skin lesion classification. With multiple experiments, we show that even in the worst cases, combining FELICIA with real data gracefully achieves performance on par with real data while most results significantly improves the utility.

[99]  arXiv:2101.07241 (cross-list from cs.RO) [pdf, other]
Title: Learning by Watching: Physical Imitation of Manipulation Skills from Human Videos
Comments: Project Website: this https URL
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

We present an approach for physical imitation from human videos for robot manipulation tasks. The key idea of our method lies in explicitly exploiting the kinematics and motion information embedded in the video to learn structured representations that endow the robot with the ability to imagine how to perform manipulation tasks in its own context. To achieve this, we design a perception module that learns to translate human videos to the robot domain followed by unsupervised keypoint detection. The resulting keypoint-based representations provide semantically meaningful information that can be directly used for reward computing and policy learning. We evaluate the effectiveness of our approach on five robot manipulation tasks, including reaching, pushing, sliding, coffee making, and drawer closing. Detailed experimental evaluations demonstrate that our method performs favorably against previous approaches.

Replacements for Tue, 19 Jan 21

[100]  arXiv:1903.12003 (replaced) [pdf, other]
Title: High Fidelity Face Manipulation with Extreme Poses and Expressions
Comments: Accepted by IEEE Transactions on Information Forensics and Security (TIFS)
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[101]  arXiv:1909.01754 (replaced) [pdf, other]
Title: An Efficient and Layout-Independent Automatic License Plate Recognition System Based on the YOLO detector
Comments: Accepted for publication in IET Intelligent Transport Systems
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[102]  arXiv:1912.02801 (replaced) [pdf, other]
Title: PolyTransform: Deep Polygon Transformer for Instance Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[103]  arXiv:1912.09629 (replaced) [pdf, other]
Title: Exploring the Capacity of an Orderless Box Discretization Network for Multi-orientation Scene Text Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[104]  arXiv:2001.11708 (replaced) [pdf, other]
Title: Generalized Visual Information Analysis via Tensorial Algebra
Comments: 42 pages, 17 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Commutative Algebra (math.AC); Rings and Algebras (math.RA)
[105]  arXiv:2002.05046 (replaced) [pdf, other]
Title: Intra-Camera Supervised Person Re-Identification
Comments: Accepted to IJCV
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[106]  arXiv:2002.10905 (replaced) [pdf, other]
Title: Fully Convolutional Neural Networks for Raw Eye Tracking Data Segmentation, Generation, and Reconstruction
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Machine Learning (stat.ML)
[107]  arXiv:2002.12041 (replaced) [pdf, other]
Title: Attention-guided Chained Context Aggregation for Semantic Segmentation
Comments: 7 figures, 12 tables, perform minor modifications to the model and add more experimental results compared with v1, under review by TNNLS
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[108]  arXiv:2003.11476 (replaced) [pdf, other]
Title: PiP: Planning-informed Trajectory Prediction for Autonomous Driving
Comments: European Conference on Computer Vision (ECCV) 2020; Project page at this http URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[109]  arXiv:2003.12673 (replaced) [pdf, other]
Title: Semantic Implicit Neural Scene Representations With Semi-Supervised Training
Comments: 3DV 2020 Camera Ready this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[110]  arXiv:2004.14557 (replaced) [pdf, other]
Title: Learning Deformable Image Registration from Optimization: Perspective, Modules, Bilevel Training and Beyond
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[111]  arXiv:2006.01031 (replaced) [pdf, other]
Title: A Smooth Representation of Belief over SO(3) for Deep Rotation Learning with Uncertainty
Comments: In Proceedings of Robotics: Science and Systems (RSS'20), Corvallis , Oregon, USA, Jul. 12-16, 2020
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[112]  arXiv:2006.06119 (replaced) [pdf, other]
Title: Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning
Comments: Accepted by ICLR 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[113]  arXiv:2006.06969 (replaced) [pdf, other]
Title: Multi Layer Neural Networks as Replacement for Pooling Operations
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[114]  arXiv:2006.08315 (replaced) [pdf, other]
Title: Mitigating Gender Bias in Captioning Systems
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[115]  arXiv:2006.12634 (replaced) [pdf, other]
Title: RP2K: A Large-Scale Retail Product Dataset for Fine-Grained Image Classification
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[116]  arXiv:2006.13144 (replaced) [pdf, other]
Title: Calibrated Adversarial Refinement for Stochastic Semantic Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[117]  arXiv:2007.10538 (replaced) [pdf, other]
Title: Regularizing Deep Networks with Semantic Data Augmentation
Comments: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI). Journal version of arXiv:1909.12220 (NeurIPS 2019). Code is available at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[118]  arXiv:2007.13640 (replaced) [pdf, other]
Title: Solving Linear Inverse Problems Using the Prior Implicit in a Denoiser
Comments: 17 pages, 13 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
[119]  arXiv:2008.04149 (replaced) [pdf, other]
Title: Deep Sketch-guided Cartoon Video Inbetweening
Comments: 15 pages, 16 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[120]  arXiv:2009.08012 (replaced) [pdf, other]
Title: Deep Momentum Uncertainty Hashing
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[121]  arXiv:2009.09399 (replaced) [pdf, other]
Title: DVG-Face: Dual Variational Generation for Heterogeneous Face Recognition
Comments: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[122]  arXiv:2010.00866 (replaced) [pdf, other]
Title: Weight and Gradient Centralization in Deep Neural Networks
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[123]  arXiv:2010.00873 (replaced) [pdf, other]
Title: Rotated Ring, Radial and Depth Wise Separable Radial Convolutions
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[124]  arXiv:2010.07367 (replaced) [pdf, other]
Title: Pose Refinement Graph Convolutional Network for Skeleton-based Action Recognition
Comments: Accepted for publication in IEEE Robotics and Automation Letters (RA-L)
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[125]  arXiv:2010.07411 (replaced) [pdf, other]
Title: Harnessing Uncertainty in Domain Adaptation for MRI Prostate Lesion Segmentation
Comments: Accepted at MICCAI 2020. Code is available at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[126]  arXiv:2010.11757 (replaced) [pdf, ps, other]
Title: Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition
Comments: Codes and models are available on this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[127]  arXiv:2010.13753 (replaced) [pdf, other]
Title: Handgun detection using combined human pose and weapon appearance
Comments: 27 pages, 17 figures; typos corrected, references added, revised explanations in sections 1, 2 and 4, results unchanged
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[128]  arXiv:2011.00301 (replaced) [pdf, other]
Title: PREGAN: Pose Randomization and Estimation for Weakly Paired Image Style Translation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[129]  arXiv:2012.07403 (replaced) [pdf]
Title: One-Shot Learning with Triplet Loss for Vegetation Classification Tasks
Authors: Alexander Uzhinskiy (1), Gennady Ososkov (1), Pavel Goncharov (1), Andrey Nechaevskiy (1), Artem Smetanin (2) ((1) Joint Institute for Nuclear Research, Dubna, Moscow region, Russia, (2) ITMO University, Saint Petersburg, Russia)
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[130]  arXiv:2012.13689 (replaced) [pdf, other]
Title: Dual-Refinement: Joint Label and Feature Refinement for Unsupervised Domain Adaptive Person Re-Identification
Comments: 14 pages, 5 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[131]  arXiv:2101.03793 (replaced) [pdf, other]
Title: The Gaze and Mouse Signal as additional Source for User Fingerprints in Browser Applications
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[132]  arXiv:2101.04240 (replaced) [pdf, other]
Title: Lesion2Vec: Deep Metric Learning for Few-Shot Multiple Lesions Recognition in Wireless Capsule Endoscopy Video
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[133]  arXiv:2101.04544 (replaced) [pdf, other]
Title: Resolution-invariant Person ReID Based on Feature Transformation and Self-weighted Attention
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[134]  arXiv:2101.05361 (replaced) [pdf, other]
Title: Random Shadows and Highlights: A new data augmentation method for extreme lighting conditions
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[135]  arXiv:2101.05479 (replaced) [pdf, other]
Title: Understanding the Role of Scene Graphs in Visual Question Answering
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[136]  arXiv:2101.05616 (replaced) [pdf]
Title: Road Surface Translation Under Snow-covered and Semantic Segmentation for Snow Hazard Index
Comments: 9 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[137]  arXiv:2101.05913 (replaced) [pdf, other]
Title: Supervised Transfer Learning at Scale for Medical Imaging
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[138]  arXiv:1711.05407 (replaced) [pdf, other]
Title: MARGIN: Uncovering Deep Neural Networks using Graph Signal Analysis
Comments: Technical Report
Subjects: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[139]  arXiv:1912.06295 (replaced) [pdf, other]
Title: A Practical Solution for SAR Despeckling With Adversarial Learning Generated Speckled-to-Speckled Images
Comments: 5 pages, 4 figures
Journal-ref: IEEE Geoscience and Remote Sensing Letters,(2020)1-5
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
[140]  arXiv:2006.08217 (replaced) [pdf, other]
Title: AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights
Comments: Accepted at ICLR 2021. First two authors contributed equally
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[141]  arXiv:2007.03669 (replaced) [pdf, other]
Title: See, Hear, Explore: Curiosity via Audio-Visual Association
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[142]  arXiv:2007.05597 (replaced) [pdf, other]
Title: EMIXER: End-to-end Multimodal X-ray Generation via Self-supervision
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[143]  arXiv:2007.07375 (replaced) [pdf, other]
Title: Concept Learners for Few-Shot Learning
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[144]  arXiv:2008.02516 (replaced) [pdf, other]
Title: FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire
Comments: Accepted by ACM MM 2020
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
[145]  arXiv:2009.09808 (replaced) [pdf, other]
Title: On the Effectiveness of Weight-Encoded Neural Implicit 3D Shapes
Subjects: Graphics (cs.GR); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)
[146]  arXiv:2010.00821 (replaced) [pdf, other]
Title: Explainable Online Validation of Machine Learning Models for Practical Applications
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[147]  arXiv:2010.04767 (replaced) [pdf]
Title: Robust Behavioral Cloning for Autonomous Vehicles using End-to-End Imitation Learning
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
[148]  arXiv:2010.05045 (replaced) [pdf, other]
Title: Interpreting Multivariate Shapley Interactions in DNNs
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[149]  arXiv:2010.09164 (replaced) [pdf, other]
Title: Evidential Sparsification of Multimodal Latent Spaces in Conditional Variational Autoencoders
Comments: 21 pages, 15 figures, 34th Conference on Neural Information Processing Systems (NeurIPS 2020)
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[150]  arXiv:2010.13372 (replaced) [pdf, other]
Title: What is the best data augmentation for 3D brain tumor segmentation?
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[151]  arXiv:2010.14535 (replaced) [pdf, other]
Title: Neural Architecture Search of SPD Manifold Networks
Comments: Info: 20 pages, 11 Figures, and 10 Tables; Added extra experimental comparison
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
[152]  arXiv:2012.07079 (replaced) [pdf, other]
Title: CHS-Net: A Deep learning approach for hierarchical segmentation of COVID-19 infected CT images
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[153]  arXiv:2012.07176 (replaced) [pdf, other]
Title: Pseudo Shots: Few-Shot Learning with Auxiliary Data
Comments: Added link to code; Added acknowledgments
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[154]  arXiv:2012.09550 (replaced) [pdf, other]
Title: Learned Block-based Hybrid Image Compression
Comments: 9 pages, 11 figures
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[155]  arXiv:2012.12535 (replaced) [pdf]
Title: StainNet: a fast and robust stain normalization network
Comments: 7 pages, 8 figures
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[156]  arXiv:2101.03244 (replaced) [pdf, other]
Title: End-to-end Prostate Cancer Detection in bpMRI via 3D CNNs: Effect of Attention Mechanisms, Clinical Priori and Decoupled False Positive Reduction
Comments: Under Review at MedIA: Medical Image Analysis. This manuscript incorporates and expands upon our 2020 Medical Imaging Meets NeurIPS Workshop paper (arXiv:2011.00263)
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[157]  arXiv:2101.03255 (replaced) [pdf, other]
Title: Good Students Play Big Lottery Better
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[ total of 157 entries: 1-157 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2101, contact, help  (Access key information)