We gratefully acknowledge support from
the Simons Foundation and member institutions.

Computer Vision and Pattern Recognition

New submissions

[ total of 101 entries: 1-101 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Thu, 22 Apr 21

[1]  arXiv:2104.10249 [pdf, other]
Title: Superpixels and Graph Convolutional Neural Networks for Efficient Detection of Nutrient Deficiency Stress from Aerial Imagery
Comments: 10 pages, 3 figures, 1 table, 1 algorithm
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Advances in remote sensing technology have led to the capture of massive amounts of data. Increased image resolution, more frequent revisit times, and additional spectral channels have created an explosion in the amount of data that is available to provide analyses and intelligence across domains, including agriculture. However, the processing of this data comes with a cost in terms of computation time and money, both of which must be considered when the goal of an algorithm is to provide real-time intelligence to improve efficiencies. Specifically, we seek to identify nutrient deficient areas from remotely sensed data to alert farmers to regions that require attention; detection of nutrient deficient areas is a key task in precision agriculture as farmers must quickly respond to struggling areas to protect their harvests. Past methods have focused on pixel-level classification (i.e. semantic segmentation) of the field to achieve these tasks, often using deep learning models with tens-of-millions of parameters. In contrast, we propose a much lighter graph-based method to perform node-based classification. We first use Simple Linear Iterative Cluster (SLIC) to produce superpixels across the field. Then, to perform segmentation across the non-Euclidean domain of superpixels, we leverage a Graph Convolutional Neural Network (GCN). This model has 4-orders-of-magnitude fewer parameters than a CNN model and trains in a matter of minutes.

[2]  arXiv:2104.10252 [pdf, other]
Title: Revisiting The Evaluation of Class Activation Mapping for Explainability: A Novel Metric and Experimental Analysis
Comments: CVPR 2021 Workshop on Responsible Computer Vision
Subjects: Computer Vision and Pattern Recognition (cs.CV)

As the request for deep learning solutions increases, the need for explainability is even more fundamental. In this setting, particular attention has been given to visualization techniques, that try to attribute the right relevance to each input pixel with respect to the output of the network. In this paper, we focus on Class Activation Mapping (CAM) approaches, which provide an effective visualization by taking weighted averages of the activation maps. To enhance the evaluation and the reproducibility of such approaches, we propose a novel set of metrics to quantify explanation maps, which show better effectiveness and simplify comparisons between approaches. To evaluate the appropriateness of the proposal, we compare different CAM-based visualization methods on the entire ImageNet validation set, fostering proper comparisons and reproducibility.

[3]  arXiv:2104.10273 [pdf, other]
Title: Disentangled Face Identity Representations for joint 3D Face Recognition and Expression Neutralisation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we propose a new deep learning-based approach for disentangling face identity representations from expressive 3D faces. Given a 3D face, our approach not only extracts a disentangled identity representation but also generates a realistic 3D face with a neutral expression while predicting its identity. The proposed network consists of three components; (1) a Graph Convolutional Autoencoder (GCA) to encode the 3D faces into latent representations, (2) a Generative Adversarial Network (GAN) that translates the latent representations of expressive faces into those of neutral faces, (3) and an identity recognition sub-network taking advantage of the neutralized latent representations for 3D face recognition. The whole network is trained in an end-to-end manner. Experiments are conducted on three publicly available datasets showing the effectiveness of the proposed approach.

[4]  arXiv:2104.10278 [pdf, other]
Title: Compact and Effective Representations for Sketch-based Image Retrieval
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Sketch-based image retrieval (SBIR) has undergone an increasing interest in the community of computer vision bringing high impact in real applications. For instance, SBIR brings an increased benefit to eCommerce search engines because it allows users to formulate a query just by drawing what they need to buy. However, current methods showing high precision in retrieval work in a high dimensional space, which negatively affects aspects like memory consumption and time processing. Although some authors have also proposed compact representations, these drastically degrade the performance in a low dimension. Therefore in this work, we present different results of evaluating methods for producing compact embeddings in the context of sketch-based image retrieval. Our main interest is in strategies aiming to keep the local structure of the original space. The recent unsupervised local-topology preserving dimension reduction method UMAP fits our requirements and shows outstanding performance, improving even the precision achieved by SOTA methods. We evaluate six methods in two different datasets. We use Flickr15K and eCommerce datasets; the latter is another contribution of this work. We show that UMAP allows us to have feature vectors of 16 bytes improving precision by more than 35%.

[5]  arXiv:2104.10291 [pdf, other]
Title: Soft Expectation and Deep Maximization for Image Feature Detection
Comments: 9 pages, 3 figures, 2 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Central to the application of many multi-view geometry algorithms is the extraction of matching points between multiple viewpoints, enabling classical tasks such as camera pose estimation and 3D reconstruction. Over the decades, many approaches that characterize these points have been proposed based on hand-tuned appearance models and more recently data-driven learning methods. We propose SEDM, an iterative semi-supervised learning process that flips the question and first looks for repeatable 3D points, then trains a detector to localize them in image space. Our technique poses the problem as one of expectation maximization (EM), where the likelihood of the detector locating the 3D points is the objective function to be maximized. We utilize the geometry of the scene to refine the estimates of the location of these 3D points and produce a new pseudo ground truth during the expectation step, then train a detector to predict this pseudo ground truth in the maximization step. We apply our detector to standard benchmarks in visual localization, sparse 3D reconstruction, and mean matching accuracy. Our results show that this new model trained using SEDM is able to better localize the underlying 3D points in a scene, improving mean SfM quality by $-0.15\pm0.11$ mean reprojection error when compared to SuperPoint or $-0.38\pm0.23$ when compared to R2D2.

[6]  arXiv:2104.10325 [pdf, other]
Title: SRWarp: Generalized Image Super-Resolution under Arbitrary Transformation
Comments: Accepted to CVPR 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Deep CNNs have achieved significant successes in image processing and its applications, including single image super-resolution (SR). However, conventional methods still resort to some predetermined integer scaling factors, e.g., x2 or x4. Thus, they are difficult to be applied when arbitrary target resolutions are required. Recent approaches extend the scope to real-valued upsampling factors, even with varying aspect ratios to handle the limitation. In this paper, we propose the SRWarp framework to further generalize the SR tasks toward an arbitrary image transformation. We interpret the traditional image warping task, specifically when the input is enlarged, as a spatially-varying SR problem. We also propose several novel formulations, including the adaptive warping layer and multiscale blending, to reconstruct visually favorable results in the transformation process. Compared with previous methods, we do not constrain the SR model on a regular grid but allow numerous possible deformations for flexible and diverse image editing. Extensive experiments and ablation studies justify the necessity and demonstrate the advantage of the proposed SRWarp method under various transformations.

[7]  arXiv:2104.10330 [pdf, other]
Title: Boundary-Aware 3D Object Detection from Point Clouds
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Currently, existing state-of-the-art 3D object detectors are in two-stage paradigm. These methods typically comprise two steps: 1) Utilize region proposal network to propose a fraction of high-quality proposals in a bottom-up fashion. 2) Resize and pool the semantic features from the proposed regions to summarize RoI-wise representations for further refinement. Note that these RoI-wise representations in step 2) are considered individually as an uncorrelated entry when fed to following detection headers. Nevertheless, we observe these proposals generated by step 1) offset from ground truth somehow, emerging in local neighborhood densely with an underlying probability. Challenges arise in the case where a proposal largely forsakes its boundary information due to coordinate offset while existing networks lack corresponding information compensation mechanism. In this paper, we propose BANet for 3D object detection from point clouds. Specifically, instead of refining each proposal independently as previous works do, we represent each proposal as a node for graph construction within a given cut-off threshold, associating proposals in the form of local neighborhood graph, with boundary correlations of an object being explicitly exploited. Besiedes, we devise a lightweight Region Feature Aggregation Network to fully exploit voxel-wise, pixel-wise, and point-wise feature with expanding receptive fields for more informative RoI-wise representations. As of Apr. 17th, 2021, our BANet achieves on par performance on KITTI 3D detection leaderboard and ranks $1^{st}$ on $Moderate$ difficulty of $Car$ category on KITTI BEV detection leaderboard. The source code will be released once the paper is accepted.

[8]  arXiv:2104.10338 [pdf, other]
Title: Shadow Generation for Composite Image in Real-world Scenes
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Image composition targets at inserting a foreground object on a background image. Most previous image composition methods focus on adjusting the foreground to make it compatible with background while ignoring the shadow effect of foreground on the background. In this work, we focus on generating plausible shadow for the foreground object in the composite image. First, we contribute a real-world shadow generation dataset DESOBA by generating synthetic composite images based on paired real images and deshadowed images. Then, we propose a novel shadow generation network SGRNet, which consists of a shadow mask prediction stage and a shadow filling stage. In the shadow mask prediction stage, foreground and background information are thoroughly interacted to generate foreground shadow mask. In the shadow filling stage, shadow parameters are predicted to fill the shadow area. Extensive experiments on our DESOBA dataset and real composite images demonstrate the effectiveness of our proposed method.

[9]  arXiv:2104.10345 [pdf, other]
Title: Measuring economic activity from space: a case study using flying airplanes and COVID-19
Comments: 11 pages, 11 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

This work introduces a novel solution to measure economic activity through remote sensing for a wide range of spatial areas. We hypothesized that disturbances in human behavior caused by major life-changing events leave signatures in satellite imagery that allows devising relevant image-based indicators to estimate their impacts and support decision-makers. We present a case study for the COVID-19 coronavirus outbreak, which imposed severe mobility restrictions and caused worldwide disruptions, using flying airplane detection around the 30 busiest airports in Europe to quantify and analyze the lockdown's effects and post-lockdown recovery. Our solution won the Rapid Action Coronavirus Earth observation (RACE) upscaling challenge, sponsored by the European Space Agency and the European Commission, and now integrates the RACE dashboard. This platform combines satellite data and artificial intelligence to promote a progressive and safe reopening of essential activities. Code and CNN models are available at https://github.com/maups/covid19-custom-script-contest

[10]  arXiv:2104.10351 [pdf, other]
Title: Improving Weakly-supervised Object Localization via Causal Intervention
Comments: 11 pages, 5 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The recent emerged weakly supervised object localization (WSOL) methods can learn to localize an object in the image only using image-level labels. Previous works endeavor to perceive the interval objects from the small and sparse discriminative attention map, yet ignoring the co-occurrence confounder (e.g., bird and sky), which makes the model inspection (e.g., CAM) hard to distinguish between the object and context. In this paper, we make an early attempt to tackle this challenge via causal intervention (CI). Our proposed method, dubbed CI-CAM, explores the causalities among images, contexts, and categories to eliminate the biased co-occurrence in the class activation maps thus improving the accuracy of object localization. Extensive experiments on several benchmarks demonstrate the effectiveness of CI-CAM in learning the clear object boundaries from confounding contexts. Particularly, in CUB-200-2011 which severely suffers from the co-occurrence confounder, CI-CAM significantly outperforms the traditional CAM-based baseline (58.39% vs 52.4% in top-1 localization accuracy). While in more general scenarios such as ImageNet, CI-CAM can also perform on par with the state of the arts.

[11]  arXiv:2104.10355 [pdf, other]
Title: Revisiting Document Representations for Large-Scale Zero-Shot Learning
Comments: Accepted to NAACL 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Zero-shot learning aims to recognize unseen objects using their semantic representations. Most existing works use visual attributes labeled by humans, not suitable for large-scale applications. In this paper, we revisit the use of documents as semantic representations. We argue that documents like Wikipedia pages contain rich visual information, which however can easily be buried by the vast amount of non-visual sentences. To address this issue, we propose a semi-automatic mechanism for visual sentence extraction that leverages the document section headers and the clustering structure of visual sentences. The extracted visual sentences, after a novel weighting scheme to distinguish similar classes, essentially form semantic representations like visual attributes but need much less human effort. On the ImageNet dataset with over 10,000 unseen classes, our representations lead to a 64% relative improvement against the commonly used ones.

[12]  arXiv:2104.10369 [pdf, other]
Title: Improvement of Normal Estimation for PointClouds via Simplifying Surface Fitting
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

With the burst development of neural networks in recent years, the task of normal estimation has once again become a concern. By introducing the neural networks to classic methods based on problem-specific knowledge, the adaptability of the normal estimation algorithm to noise and scale has been greatly improved. However, the compatibility between neural networks and the traditional methods has not been considered. Similar to the principle of Occam's razor, that is, the simpler is better. We observe that a more simplified process of surface fitting can significantly improve the accuracy of the normal estimation. In this paper, two simple-yet-effective strategies are proposed to address the compatibility between the neural networks and surface fitting process to improve normal estimation. Firstly, a dynamic top-k selection strategy is introduced to better focus on the most critical points of a given patch, and the points selected by our learning method tend to fit a surface by way of a simple tangent plane, which can dramatically improve the normal estimation results of patches with sharp corners or complex patterns. Then, we propose a point update strategy before local surface fitting, which smooths the sharp boundary of the patch to simplify the surface fitting process, significantly reducing the fitting distortion and improving the accuracy of the predicted point normal. The experiments analyze the effectiveness of our proposed strategies and demonstrate that our method achieves SOTA results with the advantage of higher estimation accuracy over most existed approaches.

[13]  arXiv:2104.10376 [pdf, other]
Title: Towards Corruption-Agnostic Robust Domain Adaptation
Comments: The first literature to investigate the topic of corruption-agnostic robust domain adaptation, a new practical and challenging domain adaptation setting
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Big progress has been achieved in domain adaptation in decades. Existing works are always based on an ideal assumption that testing target domain are i.i.d. with training target domains. However, due to unpredictable corruptions (e.g., noise and blur) in real data like web images, domain adaptation methods are increasingly required to be corruption robust on target domains. In this paper, we investigate a new task, Corruption-agnostic Robust Domain Adaptation (CRDA): to be accurate on original data and robust against unavailable-for-training corruptions on target domains. This task is non-trivial due to large domain discrepancy and unsupervised target domains. We observe that simple combinations of popular methods of domain adaptation and corruption robustness have sub-optimal CRDA results. We propose a new approach based on two technical insights into CRDA: 1) an easy-to-plug module called Domain Discrepancy Generator (DDG) that generates samples that enlarge domain discrepancy to mimic unpredictable corruptions; 2) a simple but effective teacher-student scheme with contrastive loss to enhance the constraints on target domains. Experiments verify that DDG keeps or even improves performance on original data and achieves better corruption robustness that baselines.

[14]  arXiv:2104.10386 [pdf, other]
Title: Guided Interactive Video Object Segmentation Using Reliability-Based Attention Maps
Comments: accepted to CVPR2021 (oral)
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We propose a novel guided interactive segmentation (GIS) algorithm for video objects to improve the segmentation accuracy and reduce the interaction time. First, we design the reliability-based attention module to analyze the reliability of multiple annotated frames. Second, we develop the intersection-aware propagation module to propagate segmentation results to neighboring frames. Third, we introduce the GIS mechanism for a user to select unsatisfactory frames quickly with less effort. Experimental results demonstrate that the proposed algorithm provides more accurate segmentation results at a faster speed than conventional algorithms. Codes are available at https://github.com/yuk6heo/GIS-RAmap.

[15]  arXiv:2104.10401 [pdf, ps, other]
Title: Multi-Attention-Based Soft Partition Network for Vehicle Re-Identification
Comments: 10 pages, 5 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Vehicle re-identification (Re-ID) distinguishes between the same vehicle and other vehicles in images. It is challenging due to significant intra-instance differences between identical vehicles from different views and subtle inter-instance differences of similar vehicles. Researchers have tried to address this problem by extracting features robust to variations of viewpoints and environments. More recently, they tried to improve performance by using additional metadata such as key points, orientation, and temporal information. Although these attempts have been relatively successful, they all require expensive annotations. Therefore, this paper proposes a novel deep neural network called a multi-attention-based soft partition (MUSP) network to solve this problem. This network does not use metadata and only uses multiple soft attentions to identify a specific vehicle area. This function was performed by metadata in previous studies. Experiments verified that MUSP achieved state-of-the-art (SOTA) performance for the VehicleID dataset without any additional annotations and was comparable to VeRi-776 and VERI-Wild.

[16]  arXiv:2104.10406 [pdf, other]
Title: Discrete-continuous Action Space Policy Gradient-based Attention for Image-Text Matching
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Image-text matching is an important multi-modal task with massive applications. It tries to match the image and the text with similar semantic information. Existing approaches do not explicitly transform the different modalities into a common space. Meanwhile, the attention mechanism which is widely used in image-text matching models does not have supervision. We propose a novel attention scheme which projects the image and text embedding into a common space and optimises the attention weights directly towards the evaluation metrics. The proposed attention scheme can be considered as a kind of supervised attention and requiring no additional annotations. It is trained via a novel Discrete-continuous action space policy gradient algorithm, which is more effective in modelling complex action space than previous continuous action space policy gradient. We evaluate the proposed methods on two widely-used benchmark datasets: Flickr30k and MS-COCO, outperforming the previous approaches by a large margin.

[17]  arXiv:2104.10412 [pdf, other]
Title: Comprehensive Multi-Modal Interactions for Referring Image Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the given natural language description. To solve RIS efficiently, we need to understand each word's relationship with other words, each region in the image to other regions, and cross-modal alignment between linguistic and visual domains. Recent methods model these three types of interactions sequentially. We argue that such a modular approach limits these methods' performance, and joint simultaneous reasoning can help resolve ambiguities. To this end, we propose a Joint Reasoning (JRM) module and a novel Cross-Modal Multi-Level Fusion (CMMLF) module for tackling this task. JRM effectively models the referent's multi-modal context by jointly reasoning over visual and linguistic modalities (performing word-word, image region-region, word-region interactions in a single module). CMMLF module further refines the segmentation masks by exchanging contextual information across visual hierarchy through linguistic features acting as a bridge. We present thorough ablation studies and validate our approach's performance on four benchmark datasets, and show that the proposed method outperforms the existing state-of-the-art methods on all four datasets by significant margins.

[18]  arXiv:2104.10414 [pdf, other]
Title: Orderly Dual-Teacher Knowledge Distillation for Lightweight Human Pose Estimation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Although deep convolution neural networks (DCNN) have achieved excellent performance in human pose estimation, these networks often have a large number of parameters and computations, leading to the slow inference speed. For this issue, an effective solution is knowledge distillation, which transfers knowledge from a large pre-trained network (teacher) to a small network (student). However, there are some defects in the existing approaches: (I) Only a single teacher is adopted, neglecting the potential that a student can learn from multiple teachers. (II) The human segmentation mask can be regarded as additional prior information to restrict the location of keypoints, which is never utilized. (III) A student with a small number of parameters cannot fully imitate heatmaps provided by datasets and teachers. (IV) There exists noise in heatmaps generated by teachers, which causes model degradation. To overcome these defects, we propose an orderly dual-teacher knowledge distillation (ODKD) framework, which consists of two teachers with different capabilities. Specifically, the weaker one (primary teacher, PT) is used to teach keypoints information, the stronger one (senior teacher, ST) is utilized to transfer segmentation and keypoints information by adding the human segmentation mask. Taking dual-teacher together, an orderly learning strategy is proposed to promote knowledge absorbability. Moreover, we employ a binarization operation which further improves the learning ability of the student and reduces noise in heatmaps. Experimental results on COCO and OCHuman keypoints datasets show that our proposed ODKD can improve the performance of different lightweight models by a large margin, and HRNet-W16 equipped with ODKD achieves state-of-the-art performance for lightweight human pose estimation.

[19]  arXiv:2104.10419 [pdf, other]
Title: PP-YOLOv2: A Practical Object Detector
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Being effective and efficient is essential to an object detector for practical use. To meet these two concerns, we comprehensively evaluate a collection of existing refinements to improve the performance of PP-YOLO while almost keep the infer time unchanged. This paper will analyze a collection of refinements and empirically evaluate their impact on the final model performance through incremental ablation study. Things we tried that didn't work will also be discussed. By combining multiple effective refinements, we boost PP-YOLO's performance from 45.9% mAP to 49.5% mAP on COCO2017 test-dev. Since a significant margin of performance has been made, we present PP-YOLOv2. In terms of speed, PP-YOLOv2 runs in 68.9FPS at 640x640 input size. Paddle inference engine with TensorRT, FP16-precision, and batch size = 1 further improves PP-YOLOv2's infer speed, which achieves 106.5 FPS. Such a performance surpasses existing object detectors with roughly the same amount of parameters (i.e., YOLOv4-CSP, YOLOv5l). Besides, PP-YOLOv2 with ResNet101 achieves 50.3% mAP on COCO2017 test-dev. Source code is at https://github.com/PaddlePaddle/PaddleDetection.

[20]  arXiv:2104.10420 [pdf]
Title: Machine vision detection to daily facial fatigue with a nonlocal 3D attention network
Comments: 25 pages, 6 figures, 5 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Fatigue detection is valued for people to keep mental health and prevent safety accidents. However, detecting facial fatigue, especially mild fatigue in the real world via machine vision is still a challenging issue due to lack of non-lab dataset and well-defined algorithms. In order to improve the detection capability on facial fatigue that can be used widely in daily life, this paper provided an audiovisual dataset named DLFD (daily-life fatigue dataset) which reflected people's facial fatigue state in the wild. A framework using 3D-ResNet along with non-local attention mechanism was training for extraction of local and long-range features in spatial and temporal dimensions. Then, a compacted loss function combining mean squared error and cross-entropy was designed to predict both continuous and categorical fatigue degrees. Our proposed framework has reached an average accuracy of 90.8% on validation set and 72.5% on test set for binary classification, standing a good position compared to other state-of-the-art methods. The analysis of feature map visualization revealed that our framework captured facial dynamics and attempted to build a connection with fatigue state. Our experimental results in multiple metrics proved that our framework captured some typical, micro and dynamic facial features along spatiotemporal dimensions, contributing to the mild fatigue detection in the wild.

[21]  arXiv:2104.10442 [pdf, other]
Title: Fourier Contour Embedding for Arbitrary-Shaped Text Detection
Comments: Accepted by CVPR 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

One of the main challenges for arbitrary-shaped text detection is to design a good text instance representation that allows networks to learn diverse text geometry variances. Most of existing methods model text instances in image spatial domain via masks or contour point sequences in the Cartesian or the polar coordinate system. However, the mask representation might lead to expensive post-processing, while the point sequence one may have limited capability to model texts with highly-curved shapes. To tackle these problems, we model text instances in the Fourier domain and propose one novel Fourier Contour Embedding (FCE) method to represent arbitrary shaped text contours as compact signatures. We further construct FCENet with a backbone, feature pyramid networks (FPN) and a simple post-processing with the Inverse Fourier Transformation (IFT) and Non-Maximum Suppression (NMS). Different from previous methods, FCENet first predicts compact Fourier signatures of text instances, and then reconstructs text contours via IFT and NMS during test. Extensive experiments demonstrate that FCE is accurate and robust to fit contours of scene texts even with highly-curved shapes, and also validate the effectiveness and the good generalization of FCENet for arbitrary-shaped text detection. Furthermore, experimental results show that our FCENet is superior to the state-of-the-art (SOTA) methods on CTW1500 and Total-Text, especially on challenging highly-curved text subset.

[22]  arXiv:2104.10447 [pdf, other]
Title: A Meta-Learning Approach for Medical Image Registration
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Non-rigid registration is a necessary but challenging task in medical imaging studies. Recently, unsupervised registration models have shown good performance, but they often require a large-scale training dataset and long training times. Therefore, in real world application where only dozens to hundreds of image pairs are available, existing models cannot be practically used. To address these limitations, we propose a novel unsupervised registration model which is integrated with a gradient-based meta learning framework. In particular, we train a meta learner which finds an initialization point of parameters by utilizing a variety of existing registration datasets. To quickly adapt to various tasks, the meta learner was updated to get close to the center of parameters which are fine-tuned for each registration task. Thereby, our model can adapt to unseen domain tasks via a short fine-tuning process and perform accurate registration. To verify the superiority of our model, we train the model for various 2D medical image registration tasks such as retinal choroid Optical Coherence Tomography Angiography (OCTA), CT organs, and brain MRI scans and test on registration of retinal OCTA Superficial Capillary Plexus (SCP). In our experiments, the proposed model obtained significantly improved performance in terms of accuracy and training time compared to other registration models.

[23]  arXiv:2104.10475 [pdf, other]
Title: Camouflaged Object Segmentation with Distraction Mining
Comments: CVPR 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Camouflaged object segmentation (COS) aims to identify objects that are "perfectly" assimilate into their surroundings, which has a wide range of valuable applications. The key challenge of COS is that there exist high intrinsic similarities between the candidate objects and noise background. In this paper, we strive to embrace challenges towards effective and efficient COS. To this end, we develop a bio-inspired framework, termed Positioning and Focus Network (PFNet), which mimics the process of predation in nature. Specifically, our PFNet contains two key modules, i.e., the positioning module (PM) and the focus module (FM). The PM is designed to mimic the detection process in predation for positioning the potential target objects from a global perspective and the FM is then used to perform the identification process in predation for progressively refining the coarse prediction via focusing on the ambiguous regions. Notably, in the FM, we develop a novel distraction mining strategy for distraction discovery and removal, to benefit the performance of estimation. Extensive experiments demonstrate that our PFNet runs in real-time (72 FPS) and significantly outperforms 18 cutting-edge models on three challenging datasets under four standard metrics.

[24]  arXiv:2104.10481 [pdf, other]
Title: SSLM: Self-Supervised Learning for Medical Diagnosis from MR Video
Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, which this version may no longer be accessible
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

In medical image analysis, the cost of acquiring high-quality data and their annotation by experts is a barrier in many medical applications. Most of the techniques used are based on supervised learning framework and need a large amount of annotated data to achieve satisfactory performance. As an alternative, in this paper, we propose a self-supervised learning approach to learn the spatial anatomical representations from the frames of magnetic resonance (MR) video clips for the diagnosis of knee medical conditions. The pretext model learns meaningful spatial context-invariant representations. The downstream task in our paper is a class imbalanced multi-label classification. Different experiments show that the features learnt by the pretext model provide explainable performance in the downstream task. Moreover, the efficiency and reliability of the proposed pretext model in learning representations of minority classes without applying any strategy towards imbalance in the dataset can be seen from the results. To the best of our knowledge, this work is the first work of its kind in showing the effectiveness and reliability of self-supervised learning algorithms in class imbalanced multi-label classification tasks on MR video.
The code for evaluation of the proposed work is available at https://github.com/anonymous-cvpr/sslm

[25]  arXiv:2104.10490 [pdf, other]
Title: FIERY: Future Instance Prediction in Bird's-Eye View from Surround Monocular Cameras
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Driving requires interacting with road agents and predicting their future behaviour in order to navigate safely. We present FIERY: a probabilistic future prediction model in bird's-eye view from monocular cameras. Our model predicts future instance segmentation and motion of dynamic agents that can be transformed into non-parametric future trajectories. Our approach combines the perception, sensor fusion and prediction components of a traditional autonomous driving stack by estimating bird's-eye-view prediction directly from surround RGB monocular camera inputs. FIERY learns to model the inherent stochastic nature of the future directly from camera driving data in an end-to-end manner, without relying on HD maps, and predicts multimodal future trajectories. We show that our model outperforms previous prediction baselines on the NuScenes and Lyft datasets. Code is available at https://github.com/wayveai/fiery

[26]  arXiv:2104.10492 [pdf, other]
Title: Skimming and Scanning for Untrimmed Video Action Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Video action recognition (VAR) is a primary task of video understanding, and untrimmed videos are more common in real-life scenes. Untrimmed videos have redundant and diverse clips containing contextual information, so sampling dense clips is essential. Recently, some works attempt to train a generic model to select the N most representative clips. However, it is difficult to model the complex relations from intra-class clips and inter-class videos within a single model and fixed selected number, and the entanglement of multiple relations is also hard to explain. Thus, instead of "only look once", we argue "divide and conquer" strategy will be more suitable in untrimmed VAR. Inspired by the speed reading mechanism, we propose a simple yet effective clip-level solution based on skim-scan techniques. Specifically, the proposed Skim-Scan framework first skims the entire video and drops those uninformative and misleading clips. For the remaining clips, it scans clips with diverse features gradually to drop redundant clips but cover essential content. The above strategies can adaptively select the necessary clips according to the difficulty of the different videos. To trade off the computational complexity and performance, we observe the similar statistical expression between lightweight and heavy networks, thus it supports us to explore the combination of them. Comprehensive experiments are performed on ActivityNet and mini-FCVID datasets, and results demonstrate that our solution surpasses the state-of-the-art performance in terms of both accuracy and efficiency.

[27]  arXiv:2104.10510 [pdf, other]
Title: Balanced Knowledge Distillation for Long-tailed Learning
Comments: 10 pages, 4 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Deep models trained on long-tailed datasets exhibit unsatisfactory performance on tail classes. Existing methods usually modify the classification loss to increase the learning focus on tail classes, which unexpectedly sacrifice the performance on head classes. In fact, this scheme leads to a contradiction between the two goals of long-tailed learning, i.e., learning generalizable representations and facilitating learning for tail classes. In this work, we explore knowledge distillation in long-tailed scenarios and propose a novel distillation framework, named Balanced Knowledge Distillation (BKD), to disentangle the contradiction between the two goals and achieve both simultaneously. Specifically, given a vanilla teacher model, we train the student model by minimizing the combination of an instance-balanced classification loss and a class-balanced distillation loss. The former benefits from the sample diversity and learns generalizable representation, while the latter considers the class priors and facilitates learning mainly for tail classes. The student model trained with BKD obtains significant performance gain even compared with its teacher model. We conduct extensive experiments on several long-tailed benchmark datasets and demonstrate that the proposed BKD is an effective knowledge distillation framework in long-tailed scenarios, as well as a new state-of-the-art method for long-tailed learning. Code is available at https://github.com/EricZsy/BalancedKnowledgeDistillation .

[28]  arXiv:2104.10511 [pdf, other]
Title: Hierarchical Convolutional Neural Network with Feature Preservation and Autotuned Thresholding for Crack Detection
Journal-ref: IEEE Access, 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Drone imagery is increasingly used in automated inspection for infrastructure surface defects, especially in hazardous or unreachable environments. In machine vision, the key to crack detection rests with robust and accurate algorithms for image processing. To this end, this paper proposes a deep learning approach using hierarchical convolutional neural networks with feature preservation (HCNNFP) and an intercontrast iterative thresholding algorithm for image binarization. First, a set of branch networks is proposed, wherein the output of previous convolutional blocks is half-sizedly concatenated to the current ones to reduce the obscuration in the down-sampling stage taking into account the overall information loss. Next, to extract the feature map generated from the enhanced HCNN, a binary contrast-based autotuned thresholding (CBAT) approach is developed at the post-processing step, where patterns of interest are clustered within the probability map of the identified features. The proposed technique is then applied to identify surface cracks on the surface of roads, bridges or pavements. An extensive comparison with existing techniques is conducted on various datasets and subject to a number of evaluation criteria including the average F-measure (AF\b{eta}) introduced here for dynamic quantification of the performance. Experiments on crack images, including those captured by unmanned aerial vehicles inspecting a monorail bridge. The proposed technique outperforms the existing methods on various tested datasets especially for GAPs dataset with an increase of about 1.4% in terms of AF\b{eta} while the mean percentage error drops by 2.2%. Such performance demonstrates the merits of the proposed HCNNFP architecture for surface defect inspection.

[29]  arXiv:2104.10515 [pdf, other]
Title: Real-time dense 3D Reconstruction from monocular video data captured by low-cost UAVs
Comments: 8 pages, 4 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Real-time 3D reconstruction enables fast dense mapping of the environment which benefits numerous applications, such as navigation or live evaluation of an emergency. In contrast to most real-time capable approaches, our approach does not need an explicit depth sensor. Instead, we only rely on a video stream from a camera and its intrinsic calibration. By exploiting the self-motion of the unmanned aerial vehicle (UAV) flying with oblique view around buildings, we estimate both camera trajectory and depth for selected images with enough novel content. To create a 3D model of the scene, we rely on a three-stage processing chain. First, we estimate the rough camera trajectory using a simultaneous localization and mapping (SLAM) algorithm. Once a suitable constellation is found, we estimate depth for local bundles of images using a Multi-View Stereo (MVS) approach and then fuse this depth into a global surfel-based model. For our evaluation, we use 55 video sequences with diverse settings, consisting of both synthetic and real scenes. We evaluate not only the generated reconstruction but also the intermediate products and achieve competitive results both qualitatively and quantitatively. At the same time, our method can keep up with a 30 fps video for a resolution of 768x448 pixels.

[30]  arXiv:2104.10538 [pdf, other]
Title: Guided Table Structure Recognition through Anchor Optimization
Comments: 13 pages, 8 figures, 5 tables. Submitted to IEEE Access Journal
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper presents the novel approach towards table structure recognition by leveraging the guided anchors. The concept differs from current state-of-the-art approaches for table structure recognition that naively apply object detection methods. In contrast to prior techniques, first, we estimate the viable anchors for table structure recognition. Subsequently, these anchors are exploited to locate the rows and columns in tabular images. Furthermore, the paper introduces a simple and effective method that improves the results by using tabular layouts in realistic scenarios. The proposed method is exhaustively evaluated on the two publicly available datasets of table structure recognition i.e ICDAR-2013 and TabStructDB. We accomplished state-of-the-art results on the ICDAR-2013 dataset with an average F-Measure of 95.05$\%$ (94.6$\%$ for rows and 96.32$\%$ for columns) and surpassed the baseline results on the TabStructDB dataset with an average F-Measure of 94.17$\%$ (94.08$\%$ for rows and 95.06$\%$ for columns).

[31]  arXiv:2104.10563 [pdf, other]
Title: Photothermal-SR-Net: A Customized Deep Unfolding Neural Network for Photothermal Super Resolution Imaging
Comments: 10 pages, 6 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph)

This paper presents deep unfolding neural networks to handle inverse problems in photothermal radiometry enabling super resolution (SR) imaging. Photothermal imaging is a well-known technique in active thermography for nondestructive inspection of defects in materials such as metals or composites. A grand challenge of active thermography is to overcome the spatial resolution limitation imposed by heat diffusion in order to accurately resolve each defect. The photothermal SR approach enables to extract high-frequency spatial components based on the deconvolution with the thermal point spread function. However, stable deconvolution can only be achieved by using the sparse structure of defect patterns, which often requires tedious, hand-crafted tuning of hyperparameters and results in computationally intensive algorithms. On this account, Photothermal-SR-Net is proposed in this paper, which performs deconvolution by deep unfolding considering the underlying physics. This enables to super resolve 2D thermal images for nondestructive testing with a substantially improved convergence rate. Since defects appear sparsely in materials, Photothermal-SR-Net applies trained block-sparsity thresholding to the acquired thermal images in each convolutional layer. The performance of the proposed approach is evaluated and discussed using various deep unfolding and thresholding approaches applied to 2D thermal images. Subsequently, studies are conducted on how to increase the reconstruction quality and the computational performance of Photothermal-SR-Net is evaluated. Thereby, it was found that the computing time for creating high-resolution images could be significantly reduced without decreasing the reconstruction quality by using pixel binning as a preprocessing step.

[32]  arXiv:2104.10567 [pdf, other]
Title: SOGAN: 3D-Aware Shadow and Occlusion Robust GAN for Makeup Transfer
Comments: 9 pages, 11 figures, 1 table
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

In recent years, virtual makeup applications have become more and more popular. However, it is still challenging to propose a robust makeup transfer method in the real-world environment. Current makeup transfer methods mostly work well on good-conditioned clean makeup images, but transferring makeup that exhibits shadow and occlusion is not satisfying. To alleviate it, we propose a novel makeup transfer method, called 3D-Aware Shadow and Occlusion Robust GAN (SOGAN). Given the source and the reference faces, we first fit a 3D face model and then disentangle the faces into shape and texture. In the texture branch, we map the texture to the UV space and design a UV texture generator to transfer the makeup. Since human faces are symmetrical in the UV space, we can conveniently remove the undesired shadow and occlusion from the reference image by carefully designing a Flip Attention Module (FAM). After obtaining cleaner makeup features from the reference image, a Makeup Transfer Module (MTM) is introduced to perform accurate makeup transfer. The qualitative and quantitative experiments demonstrate that our SOGAN not only achieves superior results in shadow and occlusion situations but also performs well in large pose and expression variations.

[33]  arXiv:2104.10588 [pdf, other]
Title: IB-DRR: Incremental Learning with Information-Back Discrete Representation Replay
Comments: CVPR 2021 Workshop on Continual Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Incremental learning aims to enable machine learning models to continuously acquire new knowledge given new classes, while maintaining the knowledge already learned for old classes. Saving a subset of training samples of previously seen classes in the memory and replaying them during new training phases is proven to be an efficient and effective way to fulfil this aim. It is evident that the larger number of exemplars the model inherits the better performance it can achieve. However, finding a trade-off between the model performance and the number of samples to save for each class is still an open problem for replay-based incremental learning and is increasingly desirable for real-life applications. In this paper, we approach this open problem by tapping into a two-step compression approach. The first step is a lossy compression, we propose to encode input images and save their discrete latent representations in the form of codes that are learned using a hierarchical Vector Quantised Variational Autoencoder (VQ-VAE). In the second step, we further compress codes losslessly by learning a hierarchical latent variable model with bits-back asymmetric numeral systems (BB-ANS). To compensate for the information lost in the first step compression, we introduce an Information Back (IB) mechanism that utilizes real exemplars for a contrastive learning loss to regularize the training of a classifier. By maintaining all seen exemplars' representations in the format of `codes', Discrete Representation Replay (DRR) outperforms the state-of-art method on CIFAR-100 by a margin of 4% accuracy with a much less memory cost required for saving samples. Incorporated with IB and saving a small set of old raw exemplars as well, the accuracy of DRR can be further improved by 2% accuracy.

[34]  arXiv:2104.10602 [pdf, other]
Title: Visualizing Adapted Knowledge in Domain Transfer
Journal-ref: CVPR 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

A source model trained on source data and a target model learned through unsupervised domain adaptation (UDA) usually encode different knowledge. To understand the adaptation process, we portray their knowledge difference with image translation. Specifically, we feed a translated image and its original version to the two models respectively, formulating two branches. Through updating the translated image, we force similar outputs from the two branches. When such requirements are met, differences between the two images can compensate for and hence represent the knowledge difference between models. To enforce similar outputs from the two branches and depict the adapted knowledge, we propose a source-free image translation method that generates source-style images using only target images and the two models. We visualize the adapted knowledge on several datasets with different UDA methods and find that generated images successfully capture the style difference between the two domains. For application, we show that generated images enable further tuning of the target model without accessing source data. Code available at https://github.com/hou-yz/DA_visualization.

[35]  arXiv:2104.10609 [pdf, other]
Title: Lifting Monocular Events to 3D Human Poses
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper presents a novel 3D human pose estimation approach using a single stream of asynchronous events as input. Most of the state-of-the-art approaches solve this task with RGB cameras, however struggling when subjects are moving fast. On the other hand, event-based 3D pose estimation benefits from the advantages of event-cameras, especially their efficiency and robustness to appearance changes. Yet, finding human poses in asynchronous events is in general more challenging than standard RGB pose estimation, since little or no events are triggered in static scenes. Here we propose the first learning-based method for 3D human pose from a single stream of events. Our method consists of two steps. First, we process the event-camera stream to predict three orthogonal heatmaps per joint; each heatmap is the projection of of the joint onto one orthogonal plane. Next, we fuse the sets of heatmaps to estimate 3D localisation of the body joints. As a further contribution, we make available a new, challenging dataset for event-based human pose estimation by simulating events from the RGB Human3.6m dataset. Experiments demonstrate that our method achieves solid accuracy, narrowing the performance gap between standard RGB and event-based vision. The code is freely available at https://iit-pavis.github.io/lifting_events_to_3d_hpe.

[36]  arXiv:2104.10615 [pdf, ps, other]
Title: Recurrent Feedback Improves Recognition of Partially Occluded Objects
Comments: 6 pages, 2 figures, 28th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2020). arXiv admin note: substantial text overlap with arXiv:1909.06175
Journal-ref: Proceedings of the 28th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (2020) 327-332
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Recurrent connectivity in the visual cortex is believed to aid object recognition for challenging conditions such as occlusion. Here we investigate if and how artificial neural networks also benefit from recurrence. We compare architectures composed of bottom-up, lateral and top-down connections and evaluate their performance using two novel stereoscopic occluded object datasets. We find that classification accuracy is significantly higher for recurrent models when compared to feedforward models of matched parametric complexity. Additionally we show that for challenging stimuli, the recurrent feedback is able to correctly revise the initial feedforward guess.

[37]  arXiv:2104.10642 [pdf, other]
Title: Temporal Modulation Network for Controllable Space-Time Video Super-Resolution
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Space-time video super-resolution (STVSR) aims to increase the spatial and temporal resolutions of low-resolution and low-frame-rate videos. Recently, deformable convolution based methods have achieved promising STVSR performance, but they could only infer the intermediate frame pre-defined in the training stage. Besides, these methods undervalued the short-term motion cues among adjacent frames. In this paper, we propose a Temporal Modulation Network (TMNet) to interpolate arbitrary intermediate frame(s) with accurate high-resolution reconstruction. Specifically, we propose a Temporal Modulation Block (TMB) to modulate deformable convolution kernels for controllable feature interpolation. To well exploit the temporal information, we propose a Locally-temporal Feature Comparison (LFC) module, along with the Bi-directional Deformable ConvLSTM, to extract short-term and long-term motion cues in videos. Experiments on three benchmark datasets demonstrate that our TMNet outperforms previous STVSR methods. The code is available at https://github.com/CS-GangXu/TMNet.

Cross-lists for Thu, 22 Apr 21

[38]  arXiv:2104.10195 (cross-list from eess.IV) [pdf, other]
Title: Auto-FedAvg: Learnable Federated Averaging for Multi-Institutional Medical Image Segmentation
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Federated learning (FL) enables collaborative model training while preserving each participant's privacy, which is particularly beneficial to the medical field. FedAvg is a standard algorithm that uses fixed weights, often originating from the dataset sizes at each client, to aggregate the distributed learned models on a server during the FL process. However, non-identical data distribution across clients, known as the non-i.i.d problem in FL, could make this assumption for setting fixed aggregation weights sub-optimal. In this work, we design a new data-driven approach, namely Auto-FedAvg, where aggregation weights are dynamically adjusted, depending on data distributions across data silos and the current training progress of the models. We disentangle the parameter set into two parts, local model parameters and global aggregation parameters, and update them iteratively with a communication-efficient algorithm. We first show the validity of our approach by outperforming state-of-the-art FL methods for image recognition on a heterogeneous data split of CIFAR-10. Furthermore, we demonstrate our algorithm's effectiveness on two multi-institutional medical image analysis tasks, i.e., COVID-19 lesion segmentation in chest CT and pancreas segmentation in abdominal CT.

[39]  arXiv:2104.10268 (cross-list from eess.IV) [pdf, other]
Title: TWIST-GAN: Towards Wavelet Transform and Transferred GAN for Spatio-Temporal Single Image Super Resolution
Comments: Accepted: ACM TIST (10-03-2021)
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Single Image Super-resolution (SISR) produces high-resolution images with fine spatial resolutions from aremotely sensed image with low spatial resolution. Recently, deep learning and generative adversarial networks(GANs) have made breakthroughs for the challenging task of single image super-resolution (SISR). However, thegenerated image still suffers from undesirable artifacts such as, the absence of texture-feature representationand high-frequency information. We propose a frequency domain-based spatio-temporal remote sensingsingle image super-resolution technique to reconstruct the HR image combined with generative adversarialnetworks (GANs) on various frequency bands (TWIST-GAN). We have introduced a new method incorporatingWavelet Transform (WT) characteristics and transferred generative adversarial network. The LR image hasbeen split into various frequency bands by using the WT, whereas, the transfer generative adversarial networkpredicts high-frequency components via a proposed architecture. Finally, the inverse transfer of waveletsproduces a reconstructed image with super-resolution. The model is first trained on an external DIV2 Kdataset and validated with the UC Merceed Landsat remote sensing dataset and Set14 with each image sizeof 256x256. Following that, transferred GANs are used to process spatio-temporal remote sensing images inorder to minimize computation cost differences and improve texture information. The findings are comparedqualitatively and qualitatively with the current state-of-art approaches. In addition, we saved about 43% of theGPU memory during training and accelerated the execution of our simplified version by eliminating batchnormalization layers.

[40]  arXiv:2104.10283 (cross-list from cs.CL) [pdf, other]
Title: GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering
Comments: NAACL 2021 MAI-Workshop. Code available at this https URL
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

Images are more than a collection of objects or attributes -- they represent a web of relationships among interconnected objects. Scene Graph has emerged as a new modality as a structured graphical representation of images. Scene Graph encodes objects as nodes connected via pairwise relations as edges. To support question answering on scene graphs, we propose GraphVQA, a language-guided graph neural network framework that translates and executes a natural language question as multiple iterations of message passing among graph nodes. We explore the design space of GraphVQA framework, and discuss the trade-off of different design choices. Our experiments on GQA dataset show that GraphVQA outperforms the state-of-the-art accuracy by a large margin (88.43% vs. 94.78%).

[41]  arXiv:2104.10299 (cross-list from cs.GR) [pdf, other]
Title: Voice2Mesh: Cross-Modal 3D Face Model Generation from Voices
Comments: Project page: this https URL
Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

This work focuses on the analysis that whether 3D face models can be learned from only the speech inputs of speakers. Previous works for cross-modal face synthesis study image generation from voices. However, image synthesis includes variations such as hairstyles, backgrounds, and facial textures, that are arguably irrelevant to voice or without direct studies to show correlations. We instead investigate the ability to reconstruct 3D faces to concentrate on only geometry, which is more physiologically grounded. We propose both the supervised learning and unsupervised learning frameworks. Especially we demonstrate how unsupervised learning is possible in the absence of a direct voice-to-3D-face dataset under limited availability of 3D face scans when the model is equipped with knowledge distillation. To evaluate the performance, we also propose several metrics to measure the geometric fitness of two 3D faces based on points, lines, and regions. We find that 3D face shapes can be reconstructed from voices. Experimental results suggest that 3D faces can be reconstructed from voices, and our method can improve the performance over the baseline. The best performance gains (15% - 20%) on ear-to-ear distance ratio metric (ER) coincides with the intuition that one can roughly envision whether a speaker's face is overall wider or thinner only from a person's voice. See our project page for codes and data.

[42]  arXiv:2104.10315 (cross-list from eess.IV) [pdf, ps, other]
Title: Visual Analysis Motivated Rate-Distortion Model for Image Coding
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Optimized for pixel fidelity metrics, images compressed by existing image codec are facing systematic challenges when used for visual analysis tasks, especially under low-bitrate coding. This paper proposes a visual analysis-motivated rate-distortion model for Versatile Video Coding (VVC) intra compression. The proposed model has two major contributions, a novel rate allocation strategy and a new distortion measurement model. We first propose the region of interest for machine (ROIM) to evaluate the degree of importance for each coding tree unit (CTU) in visual analysis. Then, a novel CTU-level bit allocation model is proposed based on ROIM and the local texture characteristics of each CTU. After an in-depth analysis of multiple distortion models, a visual analysis friendly distortion criteria is subsequently proposed by extracting deep feature of each coding unit (CU). To alleviate the problem of lacking spatial context information when calculating the distortion of each CU, we finally propose a multi-scale feature distortion (MSFD) metric using different neighboring pixels by weighting the extracted deep features in each scale. Extensive experimental results show that the proposed scheme could achieve up to 28.17\% bitrate saving under the same analysis performance among several typical visual analysis tasks such as image classification, object detection, and semantic segmentation.

[43]  arXiv:2104.10326 (cross-list from eess.IV) [pdf, other]
Title: A Structure-Aware Relation Network for Thoracic Diseases Detection and Segmentation
Comments: This paper has been accepted by IEEE Transactions on Medical Imaging
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Instance level detection and segmentation of thoracic diseases or abnormalities are crucial for automatic diagnosis in chest X-ray images. Leveraging on constant structure and disease relations extracted from domain knowledge, we propose a structure-aware relation network (SAR-Net) extending Mask R-CNN. The SAR-Net consists of three relation modules: 1. the anatomical structure relation module encoding spatial relations between diseases and anatomical parts. 2. the contextual relation module aggregating clues based on query-key pair of disease RoI and lung fields. 3. the disease relation module propagating co-occurrence and causal relations into disease proposals. Towards making a practical system, we also provide ChestX-Det, a chest X-Ray dataset with instance-level annotations (boxes and masks). ChestX-Det is a subset of the public dataset NIH ChestX-ray14. It contains ~3500 images of 13 common disease categories labeled by three board-certified radiologists. We evaluate our SAR-Net on it and another dataset DR-Private. Experimental results show that it can enhance the strong baseline of Mask R-CNN with significant improvements. The ChestX-Det is released at https://github.com/Deepwise-AILab/ChestX-Det-Dataset.

[44]  arXiv:2104.10329 (cross-list from cs.LG) [pdf, ps, other]
Title: Deep Transform and Metric Learning Networks
Comments: Accepted by ICASSP 2021. arXiv admin note: substantial text overlap with arXiv:2002.07898
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Based on its great successes in inference and denosing tasks, Dictionary Learning (DL) and its related sparse optimization formulations have garnered a lot of research interest. While most solutions have focused on single layer dictionaries, the recently improved Deep DL methods have also fallen short on a number of issues. We hence propose a novel Deep DL approach where each DL layer can be formulated and solved as a combination of one linear layer and a Recurrent Neural Network, where the RNN is flexibly regraded as a layer-associated learned metric. Our proposed work unveils new insights between the Neural Networks and Deep DL, and provides a novel, efficient and competitive approach to jointly learn the deep transforms and metrics. Extensive experiments are carried out to demonstrate that the proposed method can not only outperform existing Deep DL, but also state-of-the-art generic Convolutional Neural Networks.

[45]  arXiv:2104.10348 (cross-list from math.OC) [pdf, other]
Title: Fixed-Point and Objective Convergence of Plug-and-Play Algorithms
Comments: Published in IEEE Transactions on Computational Imaging
Journal-ref: in IEEE Transactions on Computational Imaging, vol. 7, pp. 337-348, 2021
Subjects: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

A standard model for image reconstruction involves the minimization of a data-fidelity term along with a regularizer, where the optimization is performed using proximal algorithms such as ISTA and ADMM. In plug-and-play (PnP) regularization, the proximal operator (associated with the regularizer) in ISTA and ADMM is replaced by a powerful image denoiser. Although PnP regularization works surprisingly well in practice, its theoretical convergence -- whether convergence of the PnP iterates is guaranteed and if they minimize some objective function -- is not completely understood even for simple linear denoisers such as nonlocal means. In particular, while there are works where either iterate or objective convergence is established separately, a simultaneous guarantee on iterate and objective convergence is not available for any denoiser to our knowledge. In this paper, we establish both forms of convergence for a special class of linear denoisers. Notably, unlike existing works where the focus is on symmetric denoisers, our analysis covers non-symmetric denoisers such as nonlocal means and almost any convex data-fidelity. The novelty in this regard is that we make use of the convergence theory of averaged operators and we work with a special inner product (and norm) derived from the linear denoiser; the latter requires us to appropriately define the gradient and proximal operators associated with the data-fidelity term. We validate our convergence results using image reconstruction experiments.

[46]  arXiv:2104.10377 (cross-list from cs.LG) [pdf, other]
Title: Dual Head Adversarial Training
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

Deep neural networks (DNNs) are known to be vulnerable to adversarial examples/attacks, raising concerns about their reliability in safety-critical applications. A number of defense methods have been proposed to train robust DNNs resistant to adversarial attacks, among which adversarial training has so far demonstrated the most promising results. However, recent studies have shown that there exists an inherent tradeoff between accuracy and robustness in adversarially-trained DNNs. In this paper, we propose a novel technique Dual Head Adversarial Training (DH-AT) to further improve the robustness of existing adversarial training methods. Different from existing improved variants of adversarial training, DH-AT modifies both the architecture of the network and the training strategy to seek more robustness. Specifically, DH-AT first attaches a second network head (or branch) to one intermediate layer of the network, then uses a lightweight convolutional neural network (CNN) to aggregate the outputs of the two heads. The training strategy is also adapted to reflect the relative importance of the two heads. We empirically show, on multiple benchmark datasets, that DH-AT can bring notable robustness improvements to existing adversarial training methods. Compared with TRADES, one state-of-the-art adversarial training method, our DH-AT can improve the robustness by 3.4% against PGD40 and 2.3% against AutoAttack, and also improve the clean accuracy by 1.8%.

[47]  arXiv:2104.10425 (cross-list from cs.LG) [pdf, other]
Title: Sparse-Shot Learning for Extremely Many Localisations
Comments: 14 pages, 7 figures, 5 tables
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Object localisation is typically considered in the context of regular images, for instance depicting objects like people or cars. In these images there is typically a relatively small number of instances per image per class, which usually is manageable to annotate. However, outside the realm of regular images we are often confronted with a different situation. In computational pathology digitised tissue sections are extremely large images, whose dimensions quickly exceed 250'000x250'000 pixels, where relevant objects, such as tumour cells or lymphocytes can quickly number in the millions. Annotating them all is practically impossible and annotating sparsely a few, out of many more, is the only possibility. Unfortunately, learning from sparse annotations, or sparse-shot learning, clashes with standard supervised learning because what is not annotated is treated as a negative. However, assigning negative labels to what are true positives leads to confusion in the gradients and biased learning. To this end, we present exclusive cross entropy, which slows down the biased learning by examining the second-order loss derivatives in order to drop the loss terms corresponding to likely biased terms. Experiments on nine datasets and two different localisation tasks, detection with YOLLO and segmentation with Unet, show that we obtain considerable improvements compared to cross entropy or focal loss, while often reaching the best possible performance for the model with only 10-40 of annotations.

[48]  arXiv:2104.10453 (cross-list from cs.LG) [pdf, other]
Title: Brittle Features May Help Anomaly Detection
Comments: Accepted to Women in Computer Vision workshop at CVPR (2021)
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

One-class anomaly detection is challenging. A representation that clearly distinguishes anomalies from normal data is ideal, but arriving at this representation is difficult since only normal data is available at training time. We examine the performance of representations, transferred from auxiliary tasks, for anomaly detection. Our results suggest that the choice of representation is more important than the anomaly detector used with these representations, although knowledge distillation can work better than using the representations directly. In addition, separability between anomalies and normal data is important but not the sole factor for a good representation, as anomaly detection performance is also correlated with more adversarially brittle features in the representation space. Finally, we show our configuration can detect 96.4% of anomalies in a genuine X-ray security dataset, outperforming previous results.

[49]  arXiv:2104.10459 (cross-list from cs.LG) [pdf, ps, other]
Title: Jacobian Regularization for Mitigating Universal Adversarial Perturbations
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

Universal Adversarial Perturbations (UAPs) are input perturbations that can fool a neural network on large sets of data. They are a class of attacks that represents a significant threat as they facilitate realistic, practical, and low-cost attacks on neural networks. In this work, we derive upper bounds for the effectiveness of UAPs based on norms of data-dependent Jacobians. We empirically verify that Jacobian regularization greatly increases model robustness to UAPs by up to four times whilst maintaining clean performance. Our theoretical analysis also allows us to formulate a metric for the strength of shared adversarial perturbations between pairs of inputs. We apply this metric to benchmark datasets and show that it is highly correlated with the actual observed robustness. This suggests that realistic and practical universal attacks can be reliably mitigated without sacrificing clean accuracy, which shows promise for the robustness of machine learning systems.

[50]  arXiv:2104.10461 (cross-list from cs.LG) [pdf, other]
Title: Improving the Accuracy of Early Exits in Multi-Exit Architectures via Curriculum Learning
Comments: Accepted by the 2021 International Joint Conference on Neural Networks (IJCNN 2021)
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Deploying deep learning services for time-sensitive and resource-constrained settings such as IoT using edge computing systems is a challenging task that requires dynamic adjustment of inference time. Multi-exit architectures allow deep neural networks to terminate their execution early in order to adhere to tight deadlines at the cost of accuracy. To mitigate this cost, in this paper we introduce a novel method called Multi-Exit Curriculum Learning that utilizes curriculum learning, a training strategy for neural networks that imitates human learning by sorting the training samples based on their difficulty and gradually introducing them to the network. Experiments on CIFAR-10 and CIFAR-100 datasets and various configurations of multi-exit architectures show that our method consistently improves the accuracy of early exits compared to the standard training approach.

[51]  arXiv:2104.10488 (cross-list from eess.IV) [pdf, other]
Title: A Two-Stage Attentive Network for Single Image Super-Resolution
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Recently, deep convolutional neural networks (CNNs) have been widely explored in single image super-resolution (SISR) and contribute remarkable progress. However, most of the existing CNNs-based SISR methods do not adequately explore contextual information in the feature extraction stage and pay little attention to the final high-resolution (HR) image reconstruction step, hence hindering the desired SR performance. To address the above two issues, in this paper, we propose a two-stage attentive network (TSAN) for accurate SISR in a coarse-to-fine manner. Specifically, we design a novel multi-context attentive block (MCAB) to make the network focus on more informative contextual features. Moreover, we present an essential refined attention block (RAB) which could explore useful cues in HR space for reconstructing fine-detailed HR image. Extensive evaluations on four benchmark datasets demonstrate the efficacy of our proposed TSAN in terms of quantitative metrics and visual effects. Code is available at https://github.com/Jee-King/TSAN.

[52]  arXiv:2104.10546 (cross-list from eess.IV) [pdf, other]
Title: Invertible Denoising Network: A Light Solution for Real Noise Removal
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Invertible networks have various benefits for image denoising since they are lightweight, information-lossless, and memory-saving during back-propagation. However, applying invertible models to remove noise is challenging because the input is noisy, and the reversed output is clean, following two different distributions. We propose an invertible denoising network, InvDN, to address this challenge. InvDN transforms the noisy input into a low-resolution clean image and a latent representation containing noise. To discard noise and restore the clean image, InvDN replaces the noisy latent representation with another one sampled from a prior distribution during reversion. The denoising performance of InvDN is better than all the existing competitive models, achieving a new state-of-the-art result for the SIDD dataset while enjoying less run time. Moreover, the size of InvDN is far smaller, only having 4.2% of the number of parameters compared to the most recently proposed DANet. Further, via manipulating the noisy latent representation, InvDN is also able to generate noise more similar to the original one. Our code is available at: https://github.com/Yang-Liu1082/InvDN.git.

[53]  arXiv:2104.10553 (cross-list from eess.IV) [pdf]
Title: Rethinking annotation granularity for overcoming deep shortcut learning: A retrospective study on chest radiographs
Comments: 22 pages of main text, 18 pages of supplementary tables
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Deep learning has demonstrated radiograph screening performances that are comparable or superior to radiologists. However, recent studies show that deep models for thoracic disease classification usually show degraded performance when applied to external data. Such phenomena can be categorized into shortcut learning, where the deep models learn unintended decision rules that can fit the identically distributed training and test set but fail to generalize to other distributions. A natural way to alleviate this defect is explicitly indicating the lesions and focusing the model on learning the intended features. In this paper, we conduct extensive retrospective experiments to compare a popular thoracic disease classification model, CheXNet, and a thoracic lesion detection model, CheXDet. We first showed that the two models achieved similar image-level classification performance on the internal test set with no significant differences under many scenarios. Meanwhile, we found incorporating external training data even led to performance degradation for CheXNet. Then, we compared the models' internal performance on the lesion localization task and showed that CheXDet achieved significantly better performance than CheXNet even when given 80% less training data. By further visualizing the models' decision-making regions, we revealed that CheXNet learned patterns other than the target lesions, demonstrating its shortcut learning defect. Moreover, CheXDet achieved significantly better external performance than CheXNet on both the image-level classification task and the lesion localization task. Our findings suggest improving annotation granularity for training deep learning systems as a promising way to elevate future deep learning-based diagnosis systems for clinical usage.

[54]  arXiv:2104.10558 (cross-list from cs.RO) [pdf, other]
Title: Contingencies from Observations: Tractable Contingency Planning with Learned Behavior Models
Comments: To be published at ICRA 2021. Project page: this https URL
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Humans have a remarkable ability to make decisions by accurately reasoning about future events, including the future behaviors and states of mind of other agents. Consider driving a car through a busy intersection: it is necessary to reason about the physics of the vehicle, the intentions of other drivers, and their beliefs about your own intentions. If you signal a turn, another driver might yield to you, or if you enter the passing lane, another driver might decelerate to give you room to merge in front. Competent drivers must plan how they can safely react to a variety of potential future behaviors of other agents before they make their next move. This requires contingency planning: explicitly planning a set of conditional actions that depend on the stochastic outcome of future events. In this work, we develop a general-purpose contingency planner that is learned end-to-end using high-dimensional scene observations and low-dimensional behavioral observations. We use a conditional autoregressive flow model to create a compact contingency planning space, and show how this model can tractably learn contingencies from behavioral observations. We developed a closed-loop control benchmark of realistic multi-agent scenarios in a driving simulator (CARLA), on which we compare our method to various noncontingent methods that reason about multi-agent future behavior, including several state-of-the-art deep learning-based planning approaches. We illustrate that these noncontingent planning methods fundamentally fail on this benchmark, and find that our deep contingency planning method achieves significantly superior performance. Code to run our benchmark and reproduce our results is available at https://sites.google.com/view/contingency-planning

[55]  arXiv:2104.10596 (cross-list from eess.IV) [pdf]
Title: Using CNNs for AD classification based on spatial correlation of BOLD signals during the observation
Comments: 11 pages
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Resting state functional magnetic resonance images (fMRI) are commonly used for classification of patients as having Alzheimer's disease (AD), mild cognitive impairment (MCI), or being cognitive normal (CN). Most methods use time-series correlation of voxels signals during the observation period as a basis for the classification. In this paper we show that Convolutional Neural Network (CNN) classification based on spatial correlation of time-averaged signals yield a classification accuracy of up to 82% (sensitivity 86%, specificity 80%)for a data set with 429 subjects (246 cognitive normal and 183 Alzheimer patients). For the spatial correlation of time-averaged signal values we use voxel subdomains around center points of the 90 regions AAL atlas. We form the subdomains as sets of voxels along a Hilbert curve of a bounding box in which the brain is embedded with the AAL regions center points serving as subdomain seeds. The matrix resulting from the spatial correlation of the 90 arrays formed by the subdomain segments of the Hilbert curve yields a symmetric 90x90 matrix that is used for the classification based on two different CNN networks, a 4-layer CNN network with 3x3 filters and with 4, 8, 16, and 32 output channels respectively, and a 2-layer CNN network with 3x3 filters and with 4 and 8 output channels respectively. The results of the two networks are reported and compared.

[56]  arXiv:2104.10603 (cross-list from eess.IV) [pdf, other]
Title: GAN-Based Data Augmentation and Anonymization for Skin-Lesion Analysis: A Critical Review
Comments: Accepted to the ISIC Skin Image Analysis Workshop @ CVPR 2021
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Despite the growing availability of high-quality public datasets, the lack of training samples is still one of the main challenges of deep-learning for skin lesion analysis. Generative Adversarial Networks (GANs) appear as an enticing alternative to alleviate the issue, by synthesizing samples indistinguishable from real images, with a plethora of works employing them for medical applications. Nevertheless, carefully designed experiments for skin-lesion diagnosis with GAN-based data augmentation show favorable results only on out-of-distribution test sets. For GAN-based data anonymization $-$ where the synthetic images replace the real ones $-$ favorable results also only appear for out-of-distribution test sets. Because of the costs and risks associated with GAN usage, those results suggest caution in their adoption for medical applications.

[57]  arXiv:2104.10611 (cross-list from eess.IV) [pdf, other]
Title: Programmable 3D snapshot microscopy with Fourier convolutional networks
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

3D snapshot microscopy enables volumetric imaging as fast as a camera allows by capturing a 3D volume in a single 2D camera image, and has found a variety of biological applications such as whole brain imaging of fast neural activity in larval zebrafish. The optimal microscope design for this optical 3D-to-2D encoding to preserve as much 3D information as possible is generally unknown and sample-dependent. Highly-programmable optical elements create new possibilities for sample-specific computational optimization of microscope parameters, e.g. tuning the collection of light for a given sample structure, especially using deep learning. This involves a differentiable simulation of light propagation through the programmable microscope and a neural network to reconstruct volumes from the microscope image. We introduce a class of global kernel Fourier convolutional neural networks which can efficiently integrate the globally mixed information encoded in a 3D snapshot image. We show in silico that our proposed global Fourier convolutional networks succeed in large field-of-view volume reconstruction and microscope parameter optimization where traditional networks fail.

[58]  arXiv:2104.10622 (cross-list from cs.GR) [pdf, other]
Title: Voxel Structure-based Mesh Reconstruction from a 3D Point Cloud
Comments: 15 pages, 28 figures, jounarl paper which has been accept by IEEE Transactions on Multimedia
Subjects: Graphics (cs.GR); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)

Mesh reconstruction from a 3D point cloud is an important topic in the fields of computer graphic, computer vision, and multimedia analysis. In this paper, we propose a voxel structure-based mesh reconstruction framework. It provides the intrinsic metric to improve the accuracy of local region detection. Based on the detected local regions, an initial reconstructed mesh can be obtained. With the mesh optimization in our framework, the initial reconstructed mesh is optimized into an isotropic one with the important geometric features such as external and internal edges. The experimental results indicate that our framework shows great advantages over peer ones in terms of mesh quality, geometric feature keeping, and processing speed.

[59]  arXiv:2104.10631 (cross-list from cs.LG) [pdf, other]
Title: MetricOpt: Learning to Optimize Black-Box Evaluation Metrics
Comments: CVPR 2021 (Oral), Supplementary Materials added
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

We study the problem of directly optimizing arbitrary non-differentiable task evaluation metrics such as misclassification rate and recall. Our method, named MetricOpt, operates in a black-box setting where the computational details of the target metric are unknown. We achieve this by learning a differentiable value function, which maps compact task-specific model parameters to metric observations. The learned value function is easily pluggable into existing optimizers like SGD and Adam, and is effective for rapidly finetuning a pre-trained model. This leads to consistent improvements since the value function provides effective metric supervision during finetuning, and helps to correct the potential bias of loss-only supervision. MetricOpt achieves state-of-the-art performance on a variety of metrics for (image) classification, image retrieval and object detection. Solid benefits are found over competing methods, which often involve complex loss design or adaptation. MetricOpt also generalizes well to new tasks and model architectures.

Replacements for Thu, 22 Apr 21

[60]  arXiv:2001.08714 (replaced) [pdf, other]
Title: Ternary Feature Masks: zero-forgetting for task-incremental learning
Comments: To appear in the IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPR-W) on Continual Learning in Computer Vision (CLVision) 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[61]  arXiv:2006.07976 (replaced) [pdf, other]
Title: Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization
Comments: Accepted in CVPR 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[62]  arXiv:2008.00951 (replaced) [pdf, other]
Title: Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation
Comments: Accepted to CVPR 2021, project page available at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[63]  arXiv:2008.01576 (replaced) [pdf, other]
Title: Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary Instructions
Comments: ECCV 2020. Introduction video at this https URL and code at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[64]  arXiv:2008.03685 (replaced) [pdf, ps, other]
Title: Semantic scene synthesis: Application to assistive systems
Comments: paper Not published
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[65]  arXiv:2010.07614 (replaced) [pdf, other]
Title: THIN: THrowable Information Networks and Application for Facial Expression Recognition In The Wild
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[66]  arXiv:2010.09125 (replaced) [pdf, other]
Title: Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering
Comments: Accepted to ICLR 2021 as an Oral paper
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[67]  arXiv:2011.11201 (replaced) [pdf, other]
Title: Concept Grounding with Modular Action-Capsules in Semantic Video Prediction
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[68]  arXiv:2011.13084 (replaced) [pdf, other]
Title: Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes
Comments: CVPR 2021, Project Website: this http URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[69]  arXiv:2012.01558 (replaced) [pdf, other]
Title: From a Fourier-Domain Perspective on Adversarial Examples to a Wiener Filter Defense for Semantic Segmentation
Comments: Accepted by The International Joint Conference on Neural Network (IJCNN) 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[70]  arXiv:2012.02493 (replaced) [pdf, other]
Title: Compositionally Generalizable 3D Structure Prediction
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[71]  arXiv:2012.07287 (replaced) [pdf, other]
Title: Information-Theoretic Segmentation by Inpainting Error Maximization
Comments: Published as a conference paper at CVPR 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[72]  arXiv:2012.12741 (replaced) [pdf, other]
Title: Exploring Data Augmentation for Multi-Modality 3D Object Detection
Comments: Technical Report
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[73]  arXiv:2102.00221 (replaced) [pdf, other]
Title: ObjectAug: Object-level Data Augmentation for Semantic Image Segmentation
Comments: 8 pages, 7 figures, 9 tables, Accepted by IJCNN2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[74]  arXiv:2102.04604 (replaced) [pdf, other]
Title: SwiftNet: Real-time Video Object Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[75]  arXiv:2102.12308 (replaced) [pdf, other]
Title: "Train one, Classify one, Teach one" -- Cross-surgery transfer learning for surgical step recognition
Comments: MIDL 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[76]  arXiv:2103.00334 (replaced) [pdf, other]
Title: BiconNet: An Edge-preserved Connectivity-based Approach for Salient Object Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
[77]  arXiv:2103.03821 (replaced) [pdf, other]
Title: Fast Interactive Video Object Segmentation with Graph Neural Networks
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[78]  arXiv:2103.10681 (replaced) [pdf, other]
Title: Learning the Superpixel in a Non-iterative and Lifelong Manner
Comments: Accept by CVPR2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[79]  arXiv:2103.15017 (replaced) [pdf, other]
Title: H-GAN: the power of GANs in your Hands
Comments: Paper accepted at The International Joint Conference on Neural Networks (IJCNN) 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[80]  arXiv:2104.00953 (replaced) [pdf, other]
Title: Learning Transferable Kinematic Dictionary for 3D Human Pose and Shape Reconstruction
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[81]  arXiv:2104.01086 (replaced) [pdf, other]
Title: Defending Against Image Corruptions Through Adversarial Augmentations
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[82]  arXiv:2104.02633 (replaced) [pdf, other]
Title: Latent Space Regularization for Unsupervised Domain Adaptation in Semantic Segmentation
Comments: 11 pages, 7 figures, 1 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[83]  arXiv:2104.05160 (replaced) [pdf, other]
Title: Feature Decomposition and Reconstruction Learning for Effective Facial Expression Recognition
Comments: accepted to CVPR 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[84]  arXiv:2104.05261 (replaced) [pdf, other]
Title: Robust Classification from Noisy Labels: Integrating Additional Knowledge for Chest Radiography Abnormality Assessment
Comments: Accepted in Medical Image Analysis (MedIA). arXiv admin note: text overlap with arXiv:1905.06362
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[85]  arXiv:2104.08045 (replaced) [pdf, other]
Title: TeLCoS: OnDevice Text Localization with Clustering of Script
Comments: Accepted for publication in IJCNN 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[86]  arXiv:2104.08052 (replaced) [pdf, other]
Title: ScreenSeg: On-Device Screenshot Layout Analysis
Comments: Accepted for publication in IJCNN 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[87]  arXiv:2104.09133 (replaced) [pdf, other]
Title: RANSIC: Fast and Highly Robust Estimation for Rotation Search and Point Cloud Registration using Invariant Compatibility
Authors: Lei Sun
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[88]  arXiv:2104.09760 (replaced) [pdf, other]
Title: HMS: Hierarchical Modality Selection for Efficient Video Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[89]  arXiv:2104.09770 (replaced) [pdf, other]
Title: M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[90]  arXiv:2104.09874 (replaced) [pdf, other]
Title: Boosting Masked Face Recognition with Multi-Task ArcFace
Comments: 6 pages, 4 figures. The paper is under consideration at Pattern Recognition Letters
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[91]  arXiv:2104.09958 (replaced) [pdf, other]
Title: GENESIS-V2: Inferring Unordered Object Representations without Iterative Refinement
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
[92]  arXiv:2008.04024 (replaced) [pdf, other]
Title: An Explainable 3D Residual Self-Attention Deep Neural Network FOR Joint Atrophy Localization and Alzheimer's Disease Diagnosis using Structural MRI
Comments: IEEE Journal of Biomedical and Health Informatics (2021)
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[93]  arXiv:2011.13786 (replaced) [pdf, other]
Title: Navigating the GAN Parameter Space for Semantic Image Editing
Comments: Supplementary code: this https URL
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[94]  arXiv:2012.07386 (replaced) [pdf, other]
Title: Phase Retrieval with Holography and Untrained Priors: Tackling the Challenges of Low-Photon Nanoscale Imaging
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics); Machine Learning (stat.ML)
[95]  arXiv:2101.06255 (replaced) [pdf, ps, other]
Title: Harmonization and the Worst Scanner Syndrome
Comments: Med-NeurIPS 2020 Workshop Paper, updated 4/2021
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
[96]  arXiv:2101.10044 (replaced) [pdf, other]
Title: Cross-lingual Visual Pre-training for Multimodal Machine Translation
Comments: Accepted to EACL 2021 (Camera-ready version)
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
[97]  arXiv:2102.08556 (replaced) [pdf, other]
Title: Deep cross-modality (MR-CT) educed distillation learning for cone beam CT lung tumor segmentation
Comments: The paper has been accepted to Medical Physics
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[98]  arXiv:2104.03002 (replaced) [pdf, other]
Title: CNN Based Segmentation of Infarcted Regions in Acute Cerebral Stroke Patients From Computed Tomography Perfusion Imaging
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[99]  arXiv:2104.07083 (replaced) [pdf]
Title: SVS-net: A Novel Semantic Segmentation Network in Optical Coherence Tomography Angiography Images
Comments: 6 pages, 6 figures
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[100]  arXiv:2104.07667 (replaced) [pdf]
Title: Shoulder Implant X-Ray Manufacturer Classification: Exploring with Vision Transformer
Comments: 11 pages, 12 figures
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[101]  arXiv:2104.09648 (replaced) [pdf, other]
Title: Memory Efficient 3D U-Net with Reversible Mobile Inverted Bottlenecks for Brain Tumor Segmentation
Comments: 11 pages, 5 figures, Published at MICCAI Brainles 2020
Journal-ref: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries (2021) 388-397
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[ total of 101 entries: 1-101 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2104, contact, help  (Access key information)