We gratefully acknowledge support from
the Simons Foundation and member institutions.

Computer Vision and Pattern Recognition

New submissions

[ total of 99 entries: 1-99 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Thu, 29 Jul 21

[1]  arXiv:2107.13029 [pdf, other]
Title: A New Split for Evaluating True Zero-Shot Action Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Zero-shot action recognition is the task of classifying action categories that are not available in the training set. In this setting, the standard evaluation protocol is to use existing action recognition datasets (e.g. UCF101) and randomly split the classes into seen and unseen. However, most recent work builds on representations pre-trained on the Kinetics dataset, where classes largely overlap with classes in the zero-shot evaluation datasets. As a result, classes which are supposed to be unseen, are present during supervised pre-training, invalidating the condition of the zero-shot setting. A similar concern was previously noted several years ago for image based zero-shot recognition, but has not been considered by the zero-shot action recognition community. In this paper, we propose a new split for true zero-shot action recognition with no overlap between unseen test classes and training or pre-training classes. We benchmark several recent approaches on the proposed True Zero-Shot (TruZe) Split for UCF101 and HMDB51, with zero-shot and generalized zero-shot evaluation. In our extensive analysis we find that our TruZe splits are significantly harder than comparable random splits as nothing is leaking from pre-training, i.e. unseen performance is consistently lower, up to 9.4% for zero-shot action recognition. In an additional evaluation we also find that similar issues exist in the splits used in few-shot action recognition, here we see differences of up to 14.1%. We publish our splits and hope that our benchmark analysis will change how the field is evaluating zero- and few-shot action recognition moving forward.

[2]  arXiv:2107.13046 [pdf, other]
Title: MixFaceNets: Extremely Efficient Face Recognition Networks
Comments: Accepted at International Join Conference on Biometrics (IJCB 2021)
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we present a set of extremely efficient and high throughput models for accurate face verification, MixFaceNets which are inspired by Mixed Depthwise Convolutional Kernels. Extensive experiment evaluations on Label Face in the Wild (LFW), Age-DB, MegaFace, and IARPA Janus Benchmarks IJB-B and IJB-C datasets have shown the effectiveness of our MixFaceNets for applications requiring extremely low computational complexity. Under the same level of computation complexity (< 500M FLOPs), our MixFaceNets outperform MobileFaceNets on all the evaluated datasets, achieving 99.60% accuracy on LFW, 97.05% accuracy on AgeDB-30, 93.60 TAR (at FAR1e-6) on MegaFace, 90.94 TAR (at FAR1e-4) on IJB-B and 93.08 TAR (at FAR1e-4) on IJB-C. With computational complexity between 500M and 1G FLOPs, our MixFaceNets achieved results comparable to the top-ranked models, while using significantly fewer FLOPs and less computation overhead, which proves the practical value of our proposed MixFaceNets. All training codes, pre-trained models, and training logs have been made available https://github.com/fdbtrs/mixfacenets.

[3]  arXiv:2107.13083 [pdf, other]
Title: Is Object Detection Necessary for Human-Object Interaction Recognition?
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

This paper revisits human-object interaction (HOI) recognition at image level without using supervisions of object location and human pose. We name it detection-free HOI recognition, in contrast to the existing detection-supervised approaches which rely on object and keypoint detections to achieve state of the art. With our method, not only the detection supervision is evitable, but superior performance can be achieved by properly using image-text pre-training (such as CLIP) and the proposed Log-Sum-Exp Sign (LSE-Sign) loss function. Specifically, using text embeddings of class labels to initialize the linear classifier is essential for leveraging the CLIP pre-trained image encoder. In addition, LSE-Sign loss facilitates learning from multiple labels on an imbalanced dataset by normalizing gradients over all classes in a softmax format. Surprisingly, our detection-free solution achieves 60.5 mAP on the HICO dataset, outperforming the detection-supervised state of the art by 13.4 mAP

[4]  arXiv:2107.13087 [pdf, other]
Title: DCL: Differential Contrastive Learning for Geometry-Aware Depth Synthesis
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We describe a method for realistic depth synthesis that learns diverse variations from the real depth scans and ensures geometric consistency for effective synthetic-to-real transfer. Unlike general image synthesis pipelines, where geometries are mostly ignored, we treat geometries carried by the depth based on their own existence. We propose differential contrastive learning that explicitly enforces the underlying geometric properties to be invariant regarding the real variations been learned. The resulting depth synthesis method is task-agnostic and can be used for training any task-specific networks with synthetic labels. We demonstrate the effectiveness of the proposed method by extensive evaluations on downstream real-world geometric reasoning tasks. We show our method achieves better synthetic-to-real transfer performance than the other state-of-the-art. When fine-tuned on a small number of real-world annotations, our method can even surpass the fully supervised baselines.

[5]  arXiv:2107.13093 [pdf, other]
Title: Automated Human Cell Classification in Sparse Datasets using Few-Shot Learning
Comments: 9 pages, 2 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Classifying and analyzing human cells is a lengthy procedure, often involving a trained professional. In an attempt to expedite this process, an active area of research involves automating cell classification through use of deep learning-based techniques. In practice, a large amount of data is required to accurately train these deep learning models. However, due to the sparse human cell datasets currently available, the performance of these models is typically low. This study investigates the feasibility of using few-shot learning-based techniques to mitigate the data requirements for accurate training. The study is comprised of three parts: First, current state-of-the-art few-shot learning techniques are evaluated on human cell classification. The selected techniques are trained on a non-medical dataset and then tested on two out-of-domain, human cell datasets. The results indicate that, overall, the test accuracy of state-of-the-art techniques decreased by at least 30% when transitioning from a non-medical dataset to a medical dataset. Second, this study evaluates the potential benefits, if any, to varying the backbone architecture and training schemes in current state-of-the-art few-shot learning techniques when used in human cell classification. Even with these variations, the overall test accuracy decreased from 88.66% on non-medical datasets to 44.13% at best on the medical datasets. Third, this study presents future directions for using few-shot learning in human cell classification. In general, few-shot learning in its current state performs poorly on human cell classification. The study proves that attempts to modify existing network architectures are not effective and concludes that future research effort should be focused on improving robustness towards out-of-domain testing using optimization-based or self-supervised few-shot learning techniques.

[6]  arXiv:2107.13098 [pdf, other]
Title: A Tale Of Two Long Tails
Comments: Preliminary results accepted to Workshop on Uncertainty and Robustness in Deep Learning (UDL), ICML, 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

As machine learning models are increasingly employed to assist human decision-makers, it becomes critical to communicate the uncertainty associated with these model predictions. However, the majority of work on uncertainty has focused on traditional probabilistic or ranking approaches - where the model assigns low probabilities or scores to uncertain examples. While this captures what examples are challenging for the model, it does not capture the underlying source of the uncertainty. In this work, we seek to identify examples the model is uncertain about and characterize the source of said uncertainty. We explore the benefits of designing a targeted intervention - targeted data augmentation of the examples where the model is uncertain over the course of training. We investigate whether the rate of learning in the presence of additional information differs between atypical and noisy examples? Our results show that this is indeed the case, suggesting that well-designed interventions over the course of training can be an effective way to characterize and distinguish between different sources of uncertainty.

[7]  arXiv:2107.13108 [pdf, other]
Title: PlaneTR: Structure-Guided Transformers for 3D Plane Recovery
Comments: ICCV 2021; Code: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper presents a neural network built upon Transformers, namely PlaneTR, to simultaneously detect and reconstruct planes from a single image. Different from previous methods, PlaneTR jointly leverages the context information and the geometric structures in a sequence-to-sequence way to holistically detect plane instances in one forward pass. Specifically, we represent the geometric structures as line segments and conduct the network with three main components: (i) context and line segments encoders, (ii) a structure-guided plane decoder, (iii) a pixel-wise plane embedding decoder. Given an image and its detected line segments, PlaneTR generates the context and line segment sequences via two specially designed encoders and then feeds them into a Transformers-based decoder to directly predict a sequence of plane instances by simultaneously considering the context and global structure cues. Finally, the pixel-wise embeddings are computed to assign each pixel to one predicted plane instance which is nearest to it in embedding space. Comprehensive experiments demonstrate that PlaneTR achieves a state-of-the-art performance on the ScanNet and NYUv2 datasets.

[8]  arXiv:2107.13111 [pdf, other]
Title: Experimenting with Self-Supervision using Rotation Prediction for Image Captioning
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Image captioning is a task in the field of Artificial Intelligence that merges between computer vision and natural language processing. It is responsible for generating legends that describe images, and has various applications like descriptions used by assistive technology or indexing images (for search engines for instance). This makes it a crucial topic in AI that is undergoing a lot of research. This task however, like many others, is trained on large images labeled via human annotation, which can be very cumbersome: it needs manual effort, both financial and temporal costs, it is error-prone and potentially difficult to execute in some cases (e.g. medical images). To mitigate the need for labels, we attempt to use self-supervised learning, a type of learning where models use the data contained within the images themselves as labels. It is challenging to accomplish though, since the task is two-fold: the images and captions come from two different modalities and usually handled by different types of networks. It is thus not obvious what a completely self-supervised solution would look like. How it would achieve captioning in a comparable way to how self-supervision is applied today on image recognition tasks is still an ongoing research topic. In this project, we are using an encoder-decoder architecture where the encoder is a convolutional neural network (CNN) trained on OpenImages dataset and learns image features in a self-supervised fashion using the rotation pretext task. The decoder is a Long Short-Term Memory (LSTM), and it is trained, along within the image captioning model, on MS COCO dataset and is responsible of generating captions. Our GitHub repository can be found: https://github.com/elhagry1/SSL_ImageCaptioning_RotationPrediction

[9]  arXiv:2107.13114 [pdf, other]
Title: A Thorough Review on Recent Deep Learning Methodologies for Image Captioning
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Image Captioning is a task that combines computer vision and natural language processing, where it aims to generate descriptive legends for images. It is a two-fold process relying on accurate image understanding and correct language understanding both syntactically and semantically. It is becoming increasingly difficult to keep up with the latest research and findings in the field of image captioning due to the growing amount of knowledge available on the topic. There is not, however, enough coverage of those findings in the available review papers. We perform in this paper a run-through of the current techniques, datasets, benchmarks and evaluation metrics used in image captioning. The current research on the field is mostly focused on deep learning-based methods, where attention mechanisms along with deep reinforcement and adversarial learning appear to be in the forefront of this research topic. In this paper, we review recent methodologies such as UpDown, OSCAR, VIVO, Meta Learning and a model that uses conditional generative adversarial nets. Although the GAN-based model achieves the highest score, UpDown represents an important basis for image captioning and OSCAR and VIVO are more useful as they use novel object captioning. This review paper serves as a roadmap for researchers to keep up to date with the latest contributions made in the field of image caption generation.

[10]  arXiv:2107.13117 [pdf, other]
Title: Image color correction, enhancement, and editing
Authors: Mahmoud Afifi
Comments: PhD dissertation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This thesis presents methods and approaches to image color correction, color enhancement, and color editing. To begin, we study the color correction problem from the standpoint of the camera's image signal processor (ISP). A camera's ISP is hardware that applies a series of in-camera image processing and color manipulation steps, many of which are nonlinear in nature, to render the initial sensor image to its final photo-finished representation saved in the 8-bit standard RGB (sRGB) color space. As white balance (WB) is one of the major procedures applied by the ISP for color correction, this thesis presents two different methods for ISP white balancing. Afterward, we discuss another scenario of correcting and editing image colors, where we present a set of methods to correct and edit WB settings for images that have been improperly white-balanced by the ISP. Then, we explore another factor that has a significant impact on the quality of camera-rendered colors, in which we outline two different methods to correct exposure errors in camera-rendered images. Lastly, we discuss post-capture auto color editing and manipulation. In particular, we propose auto image recoloring methods to generate different realistic versions of the same camera-rendered image with new colors. Through extensive evaluations, we demonstrate that our methods provide superior solutions compared to existing alternatives targeting color correction, color enhancement, and color editing.

[11]  arXiv:2107.13118 [pdf, other]
Title: Divide-and-Assemble: Learning Block-wise Memory for Unsupervised Anomaly Detection
Comments: accepted by ICCV 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Reconstruction-based methods play an important role in unsupervised anomaly detection in images. Ideally, we expect a perfect reconstruction for normal samples and poor reconstruction for abnormal samples. Since the generalizability of deep neural networks is difficult to control, existing models such as autoencoder do not work well. In this work, we interpret the reconstruction of an image as a divide-and-assemble procedure. Surprisingly, by varying the granularity of division on feature maps, we are able to modulate the reconstruction capability of the model for both normal and abnormal samples. That is, finer granularity leads to better reconstruction, while coarser granularity leads to poorer reconstruction. With proper granularity, the gap between the reconstruction error of normal and abnormal samples can be maximized. The divide-and-assemble framework is implemented by embedding a novel multi-scale block-wise memory module into an autoencoder network. Besides, we introduce adversarial learning and explore the semantic latent representation of the discriminator, which improves the detection of subtle anomaly. We achieve state-of-the-art performance on the challenging MVTec AD dataset. Remarkably, we improve the vanilla autoencoder model by 10.1% in terms of the AUROC score.

[12]  arXiv:2107.13122 [pdf, other]
Title: Subjective evaluation of traditional and learning-based image coding methods
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

We conduct a subjective experiment to compare the performance of traditional image coding methods and learning-based image coding methods. HEVC and VVC, the state-of-the-art traditional coding methods, are used as the representative traditional methods. The learning-based methods used contain not only CNN-based methods, but also a GAN-based method, all of which are advanced or typical. Single Stimuli (SS), which is also called Absolute Category Rating (ACR), is adopted as the methodology of the experiment to obtain perceptual quality of images. Additionally, we utilize some typical and frequently used objective quality metrics to evaluate the coding methods in the experiment as comparison. The experiment shows that CNN-based and GAN-based methods can perform better than traditional methods in low bit-rates. In high bit-rates, however, it is hard to verify whether CNN-based methods are superior to traditional methods. Because the GAN method does not provide models with high target bit-rates, we cannot exactly tell the performance of the GAN method in high bit-rates. Furthermore, some popular objective quality metrics have not shown the ability well to measure quality of images generated by learning-based coding methods, especially the GAN-based one.

[13]  arXiv:2107.13137 [pdf, other]
Title: Unsupervised Monocular Depth Estimation in Highly Complex Environments
Comments: 11 pages, 8 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Previous unsupervised monocular depth estimation methods mainly focus on the day-time scenario, and their frameworks are driven by warped photometric consistency. While in some challenging environments, like night, rainy night or snowy winter, the photometry of the same pixel on different frames is inconsistent because of the complex lighting and reflection, so that the day-time unsupervised frameworks cannot be directly applied to these complex scenarios. In this paper, we investigate the problem of unsupervised monocular depth estimation in certain highly complex scenarios. We address this challenging problem by using domain adaptation, and a unified image transfer-based adaptation framework is proposed based on monocular videos in this paper. The depth model trained on day-time scenarios is adapted to different complex scenarios. Instead of adapting the whole depth network, we just consider the encoder network for lower computational complexity. The depth models adapted by the proposed framework to different scenarios share the same decoder, which is practical. Constraints on both feature space and output space promote the framework to learn the key features for depth decoding, and the smoothness loss is introduced into the adaptation framework for better depth estimation performance. Extensive experiments show the effectiveness of the proposed unsupervised framework in estimating the dense depth map from the night-time, rainy night-time and snowy winter images.

[14]  arXiv:2107.13144 [pdf, other]
Title: Content-aware Directed Propagation Network with Pixel Adaptive Kernel Attention
Comments: submitted to IEEE transactions on Neural Networks and Learning System
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Convolutional neural networks (CNNs) have been not only widespread but also achieved noticeable results on numerous applications including image classification, restoration, and generation. Although the weight-sharing property of convolutions makes them widely adopted in various tasks, its content-agnostic characteristic can also be considered a major drawback. To solve this problem, in this paper, we propose a novel operation, called pixel adaptive kernel attention (PAKA). PAKA provides directivity to the filter weights by multiplying spatially varying attention from learnable features. The proposed method infers pixel-adaptive attention maps along the channel and spatial directions separately to address the decomposed model with fewer parameters. Our method is trainable in an end-to-end manner and applicable to any CNN-based models. In addition, we propose an improved information aggregation module with PAKA, called the hierarchical PAKA module (HPM). We demonstrate the superiority of our HPM by presenting state-of-the-art performance on semantic segmentation compared to the conventional information aggregation modules. We validate the proposed method through additional ablation studies and visualizing the effect of PAKA providing directivity to the weights of convolutions. We also show the generalizability of the proposed method by applying it to multi-modal tasks especially color-guided depth map super-resolution.

[15]  arXiv:2107.13152 [pdf, other]
Title: Multi Point-Voxel Convolution (MPVConv) for Deep Learning on Point Clouds
Comments: arXiv admin note: substantial text overlap with arXiv:2104.14834
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The existing 3D deep learning methods adopt either individual point-based features or local-neighboring voxel-based features, and demonstrate great potential for processing 3D data. However, the point based models are inefficient due to the unordered nature of point clouds and the voxel-based models suffer from large information loss. Motivated by the success of recent point-voxel representation, such as PVCNN, we propose a new convolutional neural network, called Multi Point-Voxel Convolution (MPVConv), for deep learning on point clouds. Integrating both the advantages of voxel and point-based methods, MPVConv can effectively increase the neighboring collection between point-based features and also promote independence among voxel-based features. Moreover, most of the existing approaches aim at solving one specific task, and only a few of them can handle a variety of tasks. Simply replacing the corresponding convolution module with MPVConv, we show that MPVConv can fit in different backbones to solve a wide range of 3D tasks. Extensive experiments on benchmark datasets such as ShapeNet Part, S3DIS and KITTI for various tasks show that MPVConv improves the accuracy of the backbone (PointNet) by up to \textbf{36\%}, and achieves higher accuracy than the voxel-based model with up to \textbf{34}$\times$ speedups. In addition, MPVConv outperforms the state-of-the-art point-based models with up to \textbf{8}$\times$ speedups. Notably, our MPVConv achieves better accuracy than the newest point-voxel-based model PVCNN (a model more efficient than PointNet) with lower latency.

[16]  arXiv:2107.13154 [pdf, other]
Title: Global Aggregation then Local Distribution for Scene Parsing
Comments: Accepted by IEEE-TIP-2021. arXiv admin note: text overlap with arXiv:1909.07229
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Modelling long-range contextual relationships is critical for pixel-wise prediction tasks such as semantic segmentation. However, convolutional neural networks (CNNs) are inherently limited to model such dependencies due to the naive structure in its building modules (\eg, local convolution kernel). While recent global aggregation methods are beneficial for long-range structure information modelling, they would oversmooth and bring noise to the regions containing fine details (\eg,~boundaries and small objects), which are very much cared for the semantic segmentation task. To alleviate this problem, we propose to explore the local context for making the aggregated long-range relationship being distributed more accurately in local regions. In particular, we design a novel local distribution module which models the affinity map between global and local relationship for each pixel adaptively. Integrating existing global aggregation modules, we show that our approach can be modularized as an end-to-end trainable block and easily plugged into existing semantic segmentation networks, giving rise to the \emph{GALD} networks. Despite its simplicity and versatility, our approach allows us to build new state of the art on major semantic segmentation benchmarks including Cityscapes, ADE20K, Pascal Context, Camvid and COCO-stuff. Code and trained models are released at \url{https://github.com/lxtGH/GALD-DGCNet} to foster further research.

[17]  arXiv:2107.13155 [pdf, other]
Title: Improving Video Instance Segmentation via Temporal Pyramid Routing
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Video Instance Segmentation (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence. Existing approaches are mainly based on single-frame features or single-scale features of multiple frames, where temporal information or multi-scale information is ignored. To incorporate both temporal and scale information, we propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames. Specifically, TPR contains two novel components, including Dynamic Aligned Cell Routing (DACR) and Cross Pyramid Routing (CPR), where DACR is designed for aligning and gating pyramid features across temporal dimension, while CPR transfers temporally aggregated features across scale dimension. Moreover, our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods. Extensive experiments on YouTube-VIS dataset demonstrate the effectiveness and efficiency of the proposed approach on several state-of-the-art instance segmentation methods. Codes and trained models will be publicly available to facilitate future research.(\url{https://github.com/lxtGH/TemporalPyramidRouting}).

[18]  arXiv:2107.13156 [pdf, other]
Title: Shape Controllable Virtual Try-on for Underwear Models
Authors: Xin Gao (1), Zhenjiang Liu (1), Zunlei Feng (2), Chengji Shen (2), Kairi Ou (1), Haihong Tang (1), Mingli Song (2) ((1) Alibaba Group, (2) Zhejiang University)
Comments: 10 pages, 9 figures, conference
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Image virtual try-on task has abundant applications and has become a hot research topic recently. Existing 2D image-based virtual try-on methods aim to transfer a target clothing image onto a reference person, which has two main disadvantages: cannot control the size and length precisely; unable to accurately estimate the user's figure in the case of users wearing thick clothes, resulting in inaccurate dressing effect. In this paper, we put forward an akin task that aims to dress clothing for underwear models. %, which is also an urgent need in e-commerce scenarios. To solve the above drawbacks, we propose a Shape Controllable Virtual Try-On Network (SC-VTON), where a graph attention network integrates the information of model and clothing to generate the warped clothing image. In addition, the control points are incorporated into SC-VTON for the desired clothing shape. Furthermore, by adding a Splitting Network and a Synthesis Network, we can use clothing/model pair data to help optimize the deformation module and generalize the task to the typical virtual try-on task. Extensive experiments show that the proposed method can achieve accurate shape control. Meanwhile, compared with other methods, our method can generate high-resolution results with detailed textures.

[19]  arXiv:2107.13167 [pdf, other]
Title: Unsupervised Segmentation for Terracotta Warrior with Seed-Region-Growing CNN(SRG-Net)
Comments: arXiv admin note: substantial text overlap with arXiv:2012.00433
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The repairing work of terracotta warriors in Emperor Qinshihuang Mausoleum Site Museum is handcrafted by experts, and the increasing amounts of unearthed pieces of terracotta warriors make the archaeologists too challenging to conduct the restoration of terracotta warriors efficiently. We hope to segment the 3D point cloud data of the terracotta warriors automatically and store the fragment data in the database to assist the archaeologists in matching the actual fragments with the ones in the database, which could result in higher repairing efficiency of terracotta warriors. Moreover, the existing 3D neural network research is mainly focusing on supervised classification, clustering, unsupervised representation, and reconstruction. There are few pieces of researches concentrating on unsupervised point cloud part segmentation. In this paper, we present SRG-Net for 3D point clouds of terracotta warriors to address these problems. Firstly, we adopt a customized seed-region-growing algorithm to segment the point cloud coarsely. Then we present a supervised segmentation and unsupervised reconstruction networks to learn the characteristics of 3D point clouds. Finally, we combine the SRG algorithm with our improved CNN using a refinement method. This pipeline is called SRG-Net, which aims at conducting segmentation tasks on the terracotta warriors. Our proposed SRG-Net is evaluated on the terracotta warriors data and ShapeNet dataset by measuring the accuracy and the latency. The experimental results show that our SRG-Net outperforms the state-of-the-art methods. Our code is shown in Code File 1~\cite{Srgnet_2021}.

[20]  arXiv:2107.13170 [pdf, other]
Title: Accurate Grid Keypoint Learning for Efficient Video Prediction
Comments: IROS2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Video prediction methods generally consume substantial computing resources in training and deployment, among which keypoint-based approaches show promising improvement in efficiency by simplifying dense image prediction to light keypoint prediction. However, keypoint locations are often modeled only as continuous coordinates, so noise from semantically insignificant deviations in videos easily disrupt learning stability, leading to inaccurate keypoint modeling. In this paper, we design a new grid keypoint learning framework, aiming at a robust and explainable intermediate keypoint representation for long-term efficient video prediction. We have two major technical contributions. First, we detect keypoints by jumping among candidate locations in our raised grid space and formulate a condensation loss to encourage meaningful keypoints with strong representative capability. Second, we introduce a 2D binary map to represent the detected grid keypoints and then suggest propagating keypoint locations with stochasticity by selecting entries in the discrete grid space, thus preserving the spatial structure of keypoints in the longterm horizon for better future frame generation. Extensive experiments verify that our method outperforms the state-ofthe-art stochastic video prediction methods while saves more than 98% of computing resources. We also demonstrate our method on a robotic-assisted surgery dataset with promising results. Our code is available at https://github.com/xjgaocs/Grid-Keypoint-Learning.

[21]  arXiv:2107.13193 [pdf, other]
Title: Assessment of Deep Learning-based Heart Rate Estimation using Remote Photoplethysmography under Different Illuminations
Comments: 3 tables, 7 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Remote photoplethysmography (rPPG) monitors heart rate without requiring physical contact, which allows for a wide variety of applications. Deep learning-based rPPG have demonstrated superior performance over the traditional approaches in controlled context. However, the lighting situation in indoor space is typically complex, with uneven light distribution and frequent variations in illumination. It lacks a fair comparison of different methods under different illuminations using the same dataset. In this paper, we present a public dataset, namely the BH-rPPG dataset, which contains data from twelve subjects under three illuminations: low, medium, and high illumination. We also provide the ground truth heart rate measured by an oximeter. We evaluate the performance of three deep learning-based methods to that of four traditional methods using two public datasets: the UBFC-rPPG dataset and the BH-rPPG dataset. The experimental results demonstrate that traditional methods are generally more resistant to fluctuating illuminations. We found that the rPPGNet achieves lowest MAE among deep learning-based method under medium illumination, whereas the CHROM achieves 1.5 beats per minute (BPM), outperforming the rPPGNet by 60%. These findings suggest that while developing deep learning-based heart rate estimation algorithms, illumination variation should be taken into account. This work serves as a benchmark for rPPG performance evaluation and it opens a pathway for future investigation into deep learning-based rPPG under illumination variations.

[22]  arXiv:2107.13217 [pdf, other]
Title: DeepTeeth: A Teeth-photo Based Human Authentication System for Mobile and Hand-held Devices
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

This paper proposes teeth-photo, a new biometric modality for human authentication on mobile and hand held devices. Biometrics samples are acquired using the camera mounted on mobile device with the help of a mobile application having specific markers to register the teeth area. Region of interest (RoI) is then extracted using the markers and the obtained sample is enhanced using contrast limited adaptive histogram equalization (CLAHE) for better visual clarity. We propose a deep learning architecture and novel regularization scheme to obtain highly discriminative embedding for small size RoI. Proposed custom loss function was able to achieve perfect classification for the tiny RoI of $75\times 75$ size. The model is end-to-end and few-shot and therefore is very efficient in terms of time and energy requirements. The system can be used in many ways including device unlocking and secure authentication. To the best of our understanding, this is the first work on teeth-photo based authentication for mobile device. Experiments have been conducted on an in-house teeth-photo database collected using our application. The database is made publicly available. Results have shown that the proposed system has perfect accuracy.

[23]  arXiv:2107.13221 [pdf, other]
Title: Normalization Matters in Weakly Supervised Object Localization
Comments: Accepted at ICCV 2021. 16 pages, 10 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Weakly-supervised object localization (WSOL) enables finding an object using a dataset without any localization information. By simply training a classification model using only image-level annotations, the feature map of the model can be utilized as a score map for localization. In spite of many WSOL methods proposing novel strategies, there has not been any de facto standard about how to normalize the class activation map (CAM). Consequently, many WSOL methods have failed to fully exploit their own capacity because of the misuse of a normalization method. In this paper, we review many existing normalization methods and point out that they should be used according to the property of the given dataset. Additionally, we propose a new normalization method which substantially enhances the performance of any CAM-based WSOL methods. Using the proposed normalization method, we provide a comprehensive evaluation over three datasets (CUB, ImageNet and OpenImages) on three different architectures and observe significant performance gains over the conventional min-max normalization method in all the evaluated cases.

[24]  arXiv:2107.13233 [pdf, other]
Title: C^3Net: End-to-End deep learning for efficient real-time visual active camera control
Authors: Christos Kyrkou
Comments: Journal of Real-Time Image Processing , 2021. Real-time active vision, Smart camera, Deep learning, End-to-end learning this https URL&ab_channel=ChristosKyrkou. arXiv admin note: text overlap with arXiv:2012.06428
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

The need for automated real-time visual systems in applications such as smart camera surveillance, smart environments, and drones necessitates the improvement of methods for visual active monitoring and control. Traditionally, the active monitoring task has been handled through a pipeline of modules such as detection, filtering, and control. However, such methods are difficult to jointly optimize and tune their various parameters for real-time processing in resource constraint systems. In this paper a deep Convolutional Camera Controller Neural Network is proposed to go directly from visual information to camera movement to provide an efficient solution to the active vision problem. It is trained end-to-end without bounding box annotations to control a camera and follow multiple targets from raw pixel values. Evaluation through both a simulation framework and real experimental setup, indicate that the proposed solution is robust to varying conditions and able to achieve better monitoring performance than traditional approaches both in terms of number of targets monitored as well as in effective monitoring time. The advantage of the proposed approach is that it is computationally less demanding and can run at over 10 FPS (~4x speedup) on an embedded smart camera providing a practical and affordable solution to real-time active monitoring.

[25]  arXiv:2107.13259 [pdf, other]
Title: TransAction: ICL-SJTU Submission to EPIC-Kitchens Action Anticipation Challenge 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this report, the technical details of our submission to the EPIC-Kitchens Action Anticipation Challenge 2021 are given. We developed a hierarchical attention model for action anticipation, which leverages Transformer-based attention mechanism to aggregate features across temporal dimension, modalities, symbiotic branches respectively. In terms of Mean Top-5 Recall of action, our submission with team name ICL-SJTU achieved 13.39% for overall testing set, 10.05% for unseen subsets and 11.88% for tailed subsets. Additionally, it is noteworthy that our submission ranked 1st in terms of verb class in all three (sub)sets.

[26]  arXiv:2107.13261 [pdf, other]
Title: Improving Multi-View Stereo via Super-Resolution
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Today, Multi-View Stereo techniques are able to reconstruct robust and detailed 3D models, especially when starting from high-resolution images. However, there are cases in which the resolution of input images is relatively low, for instance, when dealing with old photos, or when hardware constrains the amount of data that can be acquired. In this paper, we investigate if, how, and how much increasing the resolution of such input images through Super-Resolution techniques reflects in quality improvements of the reconstructed 3D models, despite the artifacts that sometimes this may generate. We show that applying a Super-Resolution step before recovering the depth maps in most cases leads to a better 3D model both in the case of PatchMatch-based and deep-learning-based algorithms. The use of Super-Resolution improves especially the completeness of reconstructed models and turns out to be particularly effective in the case of textured scenes.

[27]  arXiv:2107.13263 [pdf, other]
Title: Learning-Based Depth and Pose Estimation for Monocular Endoscope with Loss Generalization
Comments: Accepted for EMBC 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Gastroendoscopy has been a clinical standard for diagnosing and treating conditions that affect a part of a patient's digestive system, such as the stomach. Despite the fact that gastroendoscopy has a lot of advantages for patients, there exist some challenges for practitioners, such as the lack of 3D perception, including the depth and the endoscope pose information. Such challenges make navigating the endoscope and localizing any found lesion in a digestive tract difficult. To tackle these problems, deep learning-based approaches have been proposed to provide monocular gastroendoscopy with additional yet important depth and pose information. In this paper, we propose a novel supervised approach to train depth and pose estimation networks using consecutive endoscopy images to assist the endoscope navigation in the stomach. We firstly generate real depth and pose training data using our previously proposed whole stomach 3D reconstruction pipeline to avoid poor generalization ability between computer-generated (CG) models and real data for the stomach. In addition, we propose a novel generalized photometric loss function to avoid the complicated process of finding proper weights for balancing the depth and the pose loss terms, which is required for existing direct depth and pose supervision approaches. We then experimentally show that our proposed generalized loss performs better than existing direct supervision losses.

[28]  arXiv:2107.13269 [pdf, other]
Title: Aug3D-RPN: Improving Monocular 3D Object Detection by Synthetic Images with Virtual Depth
Comments: 10 pages, 8 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Current geometry-based monocular 3D object detection models can efficiently detect objects by leveraging perspective geometry, but their performance is limited due to the absence of accurate depth information. Though this issue can be alleviated in a depth-based model where a depth estimation module is plugged to predict depth information before 3D box reasoning, the introduction of such module dramatically reduces the detection speed. Instead of training a costly depth estimator, we propose a rendering module to augment the training data by synthesizing images with virtual-depths. The rendering module takes as input the RGB image and its corresponding sparse depth image, outputs a variety of photo-realistic synthetic images, from which the detection model can learn more discriminative features to adapt to the depth changes of the objects. Besides, we introduce an auxiliary module to improve the detection model by jointly optimizing it through a depth estimation task. Both modules are working in the training time and no extra computation will be introduced to the detection model. Experiments show that by working with our proposed modules, a geometry-based model can represent the leading accuracy on the KITTI 3D detection benchmark.

[29]  arXiv:2107.13271 [pdf, other]
Title: Spatial Uncertainty-Aware Semi-Supervised Crowd Counting
Comments: Accepted by ICCV2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Semi-supervised approaches for crowd counting attract attention, as the fully supervised paradigm is expensive and laborious due to its request for a large number of images of dense crowd scenarios and their annotations. This paper proposes a spatial uncertainty-aware semi-supervised approach via regularized surrogate task (binary segmentation) for crowd counting problems. Different from existing semi-supervised learning-based crowd counting methods, to exploit the unlabeled data, our proposed spatial uncertainty-aware teacher-student framework focuses on high confident regions' information while addressing the noisy supervision from the unlabeled data in an end-to-end manner. Specifically, we estimate the spatial uncertainty maps from the teacher model's surrogate task to guide the feature learning of the main task (density regression) and the surrogate task of the student model at the same time. Besides, we introduce a simple yet effective differential transformation layer to enforce the inherent spatial consistency regularization between the main task and the surrogate task in the student model, which helps the surrogate task to yield more reliable predictions and generates high-quality uncertainty maps. Thus, our model can also address the task-level perturbation problems that occur spatial inconsistency between the primary and surrogate tasks in the student model. Experimental results on four challenging crowd counting datasets demonstrate that our method achieves superior performance to the state-of-the-art semi-supervised methods.

[30]  arXiv:2107.13273 [pdf, other]
Title: Rank-based verification for long-term face tracking in crowded scenes
Comments: arXiv admin note: substantial text overlap with arXiv:2010.08675
Journal-ref: IEEE Transactions on Biometrics, Behavior, and Identity Science, 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Most current multi-object trackers focus on short-term tracking, and are based on deep and complex systems that often cannot operate in real-time, making them impractical for video-surveillance. In this paper we present a long-term, multi-face tracking architecture conceived for working in crowded contexts where faces are often the only visible part of a person. Our system benefits from advances in the fields of face detection and face recognition to achieve long-term tracking, and is particularly unconstrained to the motion and occlusions of people. It follows a tracking-by-detection approach, combining a fast short-term visual tracker with a novel online tracklet reconnection strategy grounded on rank-based face verification. The proposed rank-based constraint favours higher inter-class distance among tracklets, and reduces the propagation of errors due to wrong reconnections. Additionally, a correction module is included to correct past assignments with no extra computational cost. We present a series of experiments introducing novel specialized metrics for the evaluation of long-term tracking capabilities, and publicly release a video dataset with 10 manually annotated videos and a total length of 8' 54". Our findings validate the robustness of each of the proposed modules, and demonstrate that, in these challenging contexts, our approach yields up to 50% longer tracks than state-of-the-art deep learning trackers.

[31]  arXiv:2107.13277 [pdf, other]
Title: A Novel CropdocNet for Automated Potato Late Blight Disease Detection from the Unmanned Aerial Vehicle-based Hyperspectral Imagery
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Late blight disease is one of the most destructive diseases in potato crop, leading to serious yield losses globally. Accurate diagnosis of the disease at early stage is critical for precision disease control and management. Current farm practices in crop disease diagnosis are based on manual visual inspection, which is costly, time consuming, subject to individual bias. Recent advances in imaging sensors (e.g. RGB, multiple spectral and hyperspectral cameras), remote sensing and machine learning offer the opportunity to address this challenge. Particularly, hyperspectral imagery (HSI) combining with machine learning/deep learning approaches is preferable for accurately identifying specific plant diseases because the HSI consists of a wide range of high-quality reflectance information beyond human vision, capable of capturing both spectral-spatial information. The proposed method considers the potential disease specific reflectance radiation variance caused by the canopy structural diversity, introduces the multiple capsule layers to model the hierarchical structure of the spectral-spatial disease attributes with the encapsulated features to represent the various classes and the rotation invariance of the disease attributes in the feature space. We have evaluated the proposed method with the real UAV-based HSI data under the controlled field conditions. The effectiveness of the hierarchical features has been quantitatively assessed and compared with the existing representative machine learning/deep learning methods. The experiment results show that the proposed model significantly improves the accuracy performance when considering hierarchical-structure of spectral-spatial features, comparing to the existing methods only using spectral, or spatial or spectral-spatial features without consider hierarchical-structure of spectral-spatial features.

[32]  arXiv:2107.13279 [pdf, other]
Title: Pseudo-LiDAR Based Road Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Road detection is a critically important task for self-driving cars. By employing LiDAR data, recent works have significantly improved the accuracy of road detection. Relying on LiDAR sensors limits the wide application of those methods when only cameras are available. In this paper, we propose a novel road detection approach with RGB being the only input during inference. Specifically, we exploit pseudo-LiDAR using depth estimation, and propose a feature fusion network where RGB and learned depth information are fused for improved road detection. To further optimize the network structure and improve the efficiency of the network. we search for the network structure of the feature fusion module using NAS techniques. Finally, be aware of that generating pseudo-LiDAR from RGB via depth estimation introduces extra computational costs and relies on depth estimation networks, we design a modality distillation strategy and leverage it to further free our network from these extra computational cost and dependencies during inference. The proposed method achieves state-of-the-art performance on two challenging benchmarks, KITTI and R2D.

[33]  arXiv:2107.13335 [pdf, other]
Title: WaveCNet: Wavelet Integrated CNNs to Suppress Aliasing Effect for Noise-Robust Image Classification
Comments: IEEE TIP accepted paper. arXiv admin note: substantial text overlap with arXiv:2005.03337
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Though widely used in image classification, convolutional neural networks (CNNs) are prone to noise interruptions, i.e. the CNN output can be drastically changed by small image noise. To improve the noise robustness, we try to integrate CNNs with wavelet by replacing the common down-sampling (max-pooling, strided-convolution, and average pooling) with discrete wavelet transform (DWT). We firstly propose general DWT and inverse DWT (IDWT) layers applicable to various orthogonal and biorthogonal discrete wavelets like Haar, Daubechies, and Cohen, etc., and then design wavelet integrated CNNs (WaveCNets) by integrating DWT into the commonly used CNNs (VGG, ResNets, and DenseNet). During the down-sampling, WaveCNets apply DWT to decompose the feature maps into the low-frequency and high-frequency components. Containing the main information including the basic object structures, the low-frequency component is transmitted into the following layers to generate robust high-level features. The high-frequency components are dropped to remove most of the data noises. The experimental results show that %wavelet accelerates the CNN training, and WaveCNets achieve higher accuracy on ImageNet than various vanilla CNNs. We have also tested the performance of WaveCNets on the noisy version of ImageNet, ImageNet-C and six adversarial attacks, the results suggest that the proposed DWT/IDWT layers could provide better noise-robustness and adversarial robustness. When applying WaveCNets as backbones, the performance of object detectors (i.e., faster R-CNN and RetinaNet) on COCO detection dataset are consistently improved. We believe that suppression of aliasing effect, i.e. separation of low frequency and high frequency information, is the main advantages of our approach. The code of our DWT/IDWT layer and different WaveCNets are available at https://github.com/CVI-SZU/WaveCNet.

[34]  arXiv:2107.13355 [pdf, other]
Title: A Computer Vision-Based Approach for Driver Distraction Recognition using Deep Learning and Genetic Algorithm Based Ensemble
Comments: 12 pages, Presented in 20th International Conference on Artificial Intelligence and Soft Computing (ICAISC 2021)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

As the proportion of road accidents increases each year, driver distraction continues to be an important risk component in road traffic injuries and deaths. The distractions caused by the increasing use of mobile phones and other wireless devices pose a potential risk to road safety. Our current study aims to aid the already existing techniques in driver posture recognition by improving the performance in the driver distraction classification problem. We present an approach using a genetic algorithm-based ensemble of six independent deep neural architectures, namely, AlexNet, VGG-16, EfficientNet B0, Vanilla CNN, Modified DenseNet, and InceptionV3 + BiLSTM. We test it on two comprehensive datasets, the AUC Distracted Driver Dataset, on which our technique achieves an accuracy of 96.37%, surpassing the previously obtained 95.98%, and on the State Farm Driver Distraction Dataset, on which we attain an accuracy of 99.75%. The 6-Model Ensemble gave an inference time of 0.024 seconds as measured on our machine with Ubuntu 20.04(64-bit) and GPU as GeForce GTX 1080.

[35]  arXiv:2107.13362 [pdf, other]
Title: Graph Constrained Data Representation Learning for Human Motion Segmentation
Comments: Accepted to ICCV 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recently, transfer subspace learning based approaches have shown to be a valid alternative to unsupervised subspace clustering and temporal data clustering for human motion segmentation (HMS). These approaches leverage prior knowledge from a source domain to improve clustering performance on a target domain, and currently they represent the state of the art in HMS. Bucking this trend, in this paper, we propose a novel unsupervised model that learns a representation of the data and digs clustering information from the data itself. Our model is reminiscent of temporal subspace clustering, but presents two critical differences. First, we learn an auxiliary data matrix that can deviate from the initial data, hence confer more degrees of freedom to the coding matrix. Second, we introduce a regularization term for this auxiliary data matrix that preserves the local geometrical structure present in the high-dimensional space. The proposed model is efficiently optimized by using an original Alternating Direction Method of Multipliers (ADMM) formulation allowing to learn jointly the auxiliary data representation, a nonnegative dictionary and a coding matrix. Experimental results on four benchmark datasets for HMS demonstrate that our approach achieves significantly better clustering performance then state-of-the-art methods, including both unsupervised and more recent semi-supervised transfer learning approaches.

[36]  arXiv:2107.13379 [pdf, other]
Title: Evaluating the Use of Reconstruction Error for Novelty Localization
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

The pixelwise reconstruction error of deep autoencoders is often utilized for image novelty detection and localization under the assumption that pixels with high error indicate which parts of the input image are unfamiliar and therefore likely to be novel. This assumed correlation between pixels with high reconstruction error and novel regions of input images has not been verified and may limit the accuracy of these methods. In this paper we utilize saliency maps to evaluate whether this correlation exists. Saliency maps reveal directly how much a change in each input pixel would affect reconstruction loss, while each pixel's reconstruction error may be attributed to many input pixels when layers are fully connected. We compare saliency maps to reconstruction error maps via qualitative visualizations as well as quantitative correspondence between the top K elements of the maps for both novel and normal images. Our results indicate that reconstruction error maps do not closely correlate with the importance of pixels in the input images, making them insufficient for novelty localization.

[37]  arXiv:2107.13389 [pdf, other]
Title: SimROD: A Simple Adaptation Method for Robust Object Detection
Comments: Accepted to ICCV 2021 conference for full oral presentation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

This paper presents a Simple and effective unsupervised adaptation method for Robust Object Detection (SimROD). To overcome the challenging issues of domain shift and pseudo-label noise, our method integrates a novel domain-centric augmentation method, a gradual self-labeling adaptation procedure, and a teacher-guided fine-tuning mechanism. Using our method, target domain samples can be leveraged to adapt object detection models without changing the model architecture or generating synthetic data. When applied to image corruptions and high-level cross-domain adaptation benchmarks, our method outperforms prior baselines on multiple domain adaptation benchmarks. SimROD achieves new state-of-the-art on standard real-to-synthetic and cross-camera setup benchmarks. On the image corruption benchmark, models adapted with our method achieved a relative robustness improvement of 15-25% AP50 on Pascal-C and 5-6% AP on COCO-C and Cityscapes-C. On the cross-domain benchmark, our method outperformed the best baseline performance by up to 8% AP50 on Comic dataset and up to 4% on Watercolor dataset.

[38]  arXiv:2107.13411 [pdf, other]
Title: Predicting the Future from First Person (Egocentric) Vision: A Survey
Comments: Computer Vision and Image Understanding, 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Egocentric videos can bring a lot of information about how humans perceive the world and interact with the environment, which can be beneficial for the analysis of human behaviour. The research in egocentric video analysis is developing rapidly thanks to the increasing availability of wearable devices and the opportunities offered by new large-scale egocentric datasets. As computer vision techniques continue to develop at an increasing pace, the tasks related to the prediction of future are starting to evolve from the need of understanding the present. Predicting future human activities, trajectories and interactions with objects is crucial in applications such as human-robot interaction, assistive wearable technologies for both industrial and daily living scenarios, entertainment and virtual or augmented reality. This survey summarises the evolution of studies in the context of future prediction from egocentric vision making an overview of applications, devices, existing problems, commonly used datasets, models and input modalities. Our analysis highlights that methods for future prediction from egocentric vision can have a significant impact in a range of applications and that further research efforts should be devoted to the standardisation of tasks and the proposal of datasets considering real-world scenarios such as the ones with an industrial vocation.

[39]  arXiv:2107.13421 [pdf, other]
Title: Neural Rays for Occlusion-aware Image-based Rendering
Comments: 16 pages and 16 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

We present a new neural representation, called Neural Ray (NeuRay), for the novel view synthesis (NVS) task with multi-view images as input. Existing neural scene representations for solving the NVS problem, such as NeRF, cannot generalize to new scenes and take an excessively long time on training on each new scene from scratch. The other subsequent neural rendering methods based on stereo matching, such as PixelNeRF, SRF and IBRNet are designed to generalize to unseen scenes but suffer from view inconsistency in complex scenes with self-occlusions. To address these issues, our NeuRay method represents every scene by encoding the visibility of rays associated with the input views. This neural representation can efficiently be initialized from depths estimated by external MVS methods, which is able to generalize to new scenes and achieves satisfactory rendering images without any training on the scene. Then, the initialized NeuRay can be further optimized on every scene with little training timing to enforce spatial coherence to ensure view consistency in the presence of severe self-occlusion. Experiments demonstrate that NeuRay can quickly generate high-quality novel view images of unseen scenes with little finetuning and can handle complex scenes with severe self-occlusions which previous methods struggle with.

[40]  arXiv:2107.13429 [pdf, other]
Title: Task-Specific Normalization for Continual Learning of Blind Image Quality Models
Comments: 12 pages, 6 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The computational vision community has recently paid attention to continual learning for blind image quality assessment (BIQA). The primary challenge is to combat catastrophic forgetting of previously-seen IQA datasets (i.e., tasks). In this paper, we present a simple yet effective continual learning method for BIQA with improved quality prediction accuracy, plasticity-stability trade-off, and task-order/length robustness. The key step in our approach is to freeze all convolution filters of a pre-trained deep neural network (DNN) for an explicit promise of stability, and learn task-specific normalization parameters for plasticity. We assign each new task a prediction head, and load the corresponding normalization parameters to produce a quality score. The final quality estimate is computed by feature fusion and adaptive weighting using hierarchical representations, without leveraging the test-time oracle. Extensive experiments on six IQA datasets demonstrate the advantages of the proposed method in comparison to previous training techniques for BIQA.

[41]  arXiv:2107.13452 [pdf, other]
Title: CarveNet: Carving Point-Block for Complex 3D Shape Completion
Comments: 10 pages and 10 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

3D point cloud completion is very challenging because it heavily relies on the accurate understanding of the complex 3D shapes (e.g., high-curvature, concave/convex, and hollowed-out 3D shapes) and the unknown & diverse patterns of the partially available point clouds. In this paper, we propose a novel solution,i.e., Point-block Carving (PC), for completing the complex 3D point cloud completion. Given the partial point cloud as the guidance, we carve a3D block that contains the uniformly distributed 3D points, yielding the entire point cloud. To achieve PC, we propose a new network architecture, i.e., CarveNet. This network conducts the exclusive convolution on each point of the block, where the convolutional kernels are trained on the 3D shape data. CarveNet determines which point should be carved, for effectively recovering the details of the complete shapes. Furthermore, we propose a sensor-aware method for data augmentation,i.e., SensorAug, for training CarveNet on richer patterns of partial point clouds, thus enhancing the completion power of the network. The extensive evaluations on the ShapeNet and KITTI datasets demonstrate the generality of our approach on the partial point clouds with diverse patterns. On these datasets, CarveNet successfully outperforms the state-of-the-art methods.

[42]  arXiv:2107.13459 [pdf, other]
Title: Surrogate Model-Based Explainability Methods for Point Cloud NNs
Comments: 16 pages, 5 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

In the field of autonomous driving and robotics, point clouds are showing their excellent real-time performance as raw data from most of the mainstream 3D sensors. Therefore, point cloud neural networks have become a popular research direction in recent years. So far, however, there has been little discussion about the explainability of deep neural networks for point clouds. In this paper, we propose new explainability approaches for point cloud deep neural networks based on local surrogate model-based methods to show which components make the main contribution to the classification. Moreover, we propose a quantitative validation method for explainability methods of point clouds which enhances the persuasive power of explainability by dropping the most positive or negative contributing features and monitoring how the classification scores of specific categories change. To enable an intuitive explanation of misclassified instances, we display features with confounding contributions. Our new explainability approach provides a fairly accurate, more intuitive and widely applicable explanation for point cloud classification tasks. Our code is available at https://github.com/Explain3D/Explainable3D

[43]  arXiv:2107.13463 [pdf, other]
Title: Learning the shape of female breasts: an open-access 3D statistical shape model of the female breast built from 110 breast scans
Comments: 15 pages, 14 figures, for download of RBSM visit this https URL , submitted to Medical Image Analysis
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We present the Regensburg Breast Shape Model (RBSM) - a 3D statistical shape model of the female breast built from 110 breast scans, and the first ever publicly available. Together with the model, a fully automated, pairwise surface registration pipeline used to establish correspondence among 3D breast scans is introduced. Our method is computationally efficient and requires only four landmarks to guide the registration process. In order to weaken the strong coupling between breast and thorax, we propose to minimize the variance outside the breast region as much as possible. To achieve this goal, a novel concept called breast probability masks (BPMs) is introduced. A BPM assigns probabilities to each point of a 3D breast scan, telling how likely it is that a particular point belongs to the breast area. During registration, we use BPMs to align the template to the target as accurately as possible inside the breast region and only roughly outside. This simple yet effective strategy significantly reduces the unwanted variance outside the breast region, leading to better statistical shape models in which breast shapes are quite well decoupled from the thorax. The RBSM is thus able to produce a variety of different breast shapes as independently as possible from the shape of the thorax. Our systematic experimental evaluation reveals a generalization ability of 0.17 mm and a specificity of 2.8 mm for the RBSM. Ultimately, our model is seen as a first step towards combining physically motivated deformable models of the breast and statistical approaches in order to enable more realistic surgical outcome simulation.

[44]  arXiv:2107.13465 [pdf]
Title: A Proof-of-Concept Study of Artificial Intelligence Assisted Contour Revision
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Automatic segmentation of anatomical structures is critical for many medical applications. However, the results are not always clinically acceptable and require tedious manual revision. Here, we present a novel concept called artificial intelligence assisted contour revision (AIACR) and demonstrate its feasibility. The proposed clinical workflow of AIACR is as follows given an initial contour that requires a clinicians revision, the clinician indicates where a large revision is needed, and a trained deep learning (DL) model takes this input to update the contour. This process repeats until a clinically acceptable contour is achieved. The DL model is designed to minimize the clinicians input at each iteration and to minimize the number of iterations needed to reach acceptance. In this proof-of-concept study, we demonstrated the concept on 2D axial images of three head-and-neck cancer datasets, with the clinicians input at each iteration being one mouse click on the desired location of the contour segment. The performance of the model is quantified with Dice Similarity Coefficient (DSC) and 95th percentile of Hausdorff Distance (HD95). The average DSC/HD95 (mm) of the auto-generated initial contours were 0.82/4.3, 0.73/5.6 and 0.67/11.4 for three datasets, which were improved to 0.91/2.1, 0.86/2.4 and 0.86/4.7 with three mouse clicks, respectively. Each DL-based contour update requires around 20 ms. We proposed a novel AIACR concept that uses DL models to assist clinicians in revising contours in an efficient and effective way, and we demonstrated its feasibility by using 2D axial CT images from three head-and-neck cancer datasets.

[45]  arXiv:2107.13467 [pdf, other]
Title: Recursively Conditional Gaussian for Ordinal Unsupervised Domain Adaptation
Comments: Accepted to ICCV 2021 (Oral)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

There has been a growing interest in unsupervised domain adaptation (UDA) to alleviate the data scalability issue, while the existing works usually focus on classifying independently discrete labels. However, in many tasks (e.g., medical diagnosis), the labels are discrete and successively distributed. The UDA for ordinal classification requires inducing non-trivial ordinal distribution prior to the latent space. Target for this, the partially ordered set (poset) is defined for constraining the latent vector. Instead of the typically i.i.d. Gaussian latent prior, in this work, a recursively conditional Gaussian (RCG) set is proposed for ordered constraint modeling, which admits a tractable joint distribution prior. Furthermore, we are able to control the density of content vectors that violate the poset constraint by a simple "three-sigma rule". We explicitly disentangle the cross-domain images into a shared ordinal prior induced ordinal content space and two separate source/target ordinal-unrelated spaces, and the self-training is worked on the shared space exclusively for ordinal-aware domain alignment. Extensive experiments on UDA medical diagnoses and facial age estimation demonstrate its effectiveness.

[46]  arXiv:2107.13469 [pdf, other]
Title: Adversarial Unsupervised Domain Adaptation with Conditional and Label Shift: Infer, Align and Iterate
Comments: Accepted to ICCV 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)

In this work, we propose an adversarial unsupervised domain adaptation (UDA) approach with the inherent conditional and label shifts, in which we aim to align the distributions w.r.t. both $p(x|y)$ and $p(y)$. Since the label is inaccessible in the target domain, the conventional adversarial UDA assumes $p(y)$ is invariant across domains, and relies on aligning $p(x)$ as an alternative to the $p(x|y)$ alignment. To address this, we provide a thorough theoretical and empirical analysis of the conventional adversarial UDA methods under both conditional and label shifts, and propose a novel and practical alternative optimization scheme for adversarial UDA. Specifically, we infer the marginal $p(y)$ and align $p(x|y)$ iteratively in the training, and precisely align the posterior $p(y|x)$ in testing. Our experimental results demonstrate its effectiveness on both classification and segmentation UDA, and partial UDA.

[47]  arXiv:2107.13484 [pdf, other]
Title: Inferring bias and uncertainty in camera calibration
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Accurate camera calibration is a precondition for many computer vision applications. Calibration errors, such as wrong model assumptions or imprecise parameter estimation, can deteriorate a system's overall performance, making the reliable detection and quantification of these errors critical. In this work, we introduce an evaluation scheme to capture the fundamental error sources in camera calibration: systematic errors (biases) and uncertainty (variance). The proposed bias detection method uncovers smallest systematic errors and thereby reveals imperfections of the calibration setup and provides the basis for camera model selection. A novel resampling-based uncertainty estimator enables uncertainty estimation under non-ideal conditions and thereby extends the classical covariance estimator. Furthermore, we derive a simple uncertainty metric that is independent of the camera model. In combination, the proposed methods can be used to assess the accuracy of individual calibrations, but also to benchmark new calibration algorithms, camera models, or calibration setups. We evaluate the proposed methods with simulations and real cameras.

[48]  arXiv:2107.13516 [pdf, other]
Title: CRD-CGAN: Category-Consistent and Relativistic Constraints for Diverse Text-to-Image Generation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Generating photo-realistic images from a text description is a challenging problem in computer vision. Previous works have shown promising performance to generate synthetic images conditional on text by Generative Adversarial Networks (GANs). In this paper, we focus on the category-consistent and relativistic diverse constraints to optimize the diversity of synthetic images. Based on those constraints, a category-consistent and relativistic diverse conditional GAN (CRD-CGAN) is proposed to synthesize $K$ photo-realistic images simultaneously. We use the attention loss and diversity loss to improve the sensitivity of the GAN to word attention and noises. Then, we employ the relativistic conditional loss to estimate the probability of relatively real or fake for synthetic images, which can improve the performance of basic conditional loss. Finally, we introduce a category-consistent loss to alleviate the over-category issues between K synthetic images. We evaluate our approach using the Birds-200-2011, Oxford-102 flower and MSCOCO 2014 datasets, and the extensive experiments demonstrate superiority of the proposed method in comparison with state-of-the-art methods in terms of photorealistic and diversity of the generated synthetic images.

Cross-lists for Thu, 29 Jul 21

[49]  arXiv:2107.13048 (cross-list from eess.IV) [pdf, other]
Title: Whole Slide Images are 2D Point Clouds: Context-Aware Survival Prediction using Patch-based Graph Convolutional Networks
Comments: MICCAI 2021
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Tissues and Organs (q-bio.TO)

Cancer prognostication is a challenging task in computational pathology that requires context-aware representations of histology features to adequately infer patient survival. Despite the advancements made in weakly-supervised deep learning, many approaches are not context-aware and are unable to model important morphological feature interactions between cell identities and tissue types that are prognostic for patient survival. In this work, we present Patch-GCN, a context-aware, spatially-resolved patch-based graph convolutional network that hierarchically aggregates instance-level histology features to model local- and global-level topological structures in the tumor microenvironment. We validate Patch-GCN with 4,370 gigapixel WSIs across five different cancer types from the Cancer Genome Atlas (TCGA), and demonstrate that Patch-GCN outperforms all prior weakly-supervised approaches by 3.58-9.46%. Our code and corresponding models are publicly available at https://github.com/mahmoodlab/Patch-GCN.

[50]  arXiv:2107.13054 (cross-list from cs.AI) [pdf, other]
Title: Exceeding the Limits of Visual-Linguistic Multi-Task Learning
Comments: 10 pages, 7 figures
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

By leveraging large amounts of product data collected across hundreds of live e-commerce websites, we construct 1000 unique classification tasks that share similarly-structured input data, comprised of both text and images. These classification tasks focus on learning the product hierarchy of different e-commerce websites, causing many of them to be correlated. Adopting a multi-modal transformer model, we solve these tasks in unison using multi-task learning (MTL). Extensive experiments are presented over an initial 100-task dataset to reveal best practices for "large-scale MTL" (i.e., MTL with more than 100 tasks). From these experiments, a final, unified methodology is derived, which is composed of both best practices and new proposals such as DyPa, a simple heuristic for automatically allocating task-specific parameters to tasks that could benefit from extra capacity. Using our large-scale MTL methodology, we successfully train a single model across all 1000 tasks in our dataset while using minimal task specific parameters, thereby showing that it is possible to extend several orders of magnitude beyond current efforts in MTL.

[51]  arXiv:2107.13136 (cross-list from eess.IV) [pdf, other]
Title: Insights from Generative Modeling for Neural Video Compression
Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. arXiv admin note: text overlap with arXiv:2010.10258
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

While recent machine learning research has revealed connections between deep generative models such as VAEs and rate-distortion losses used in learned compression, most of this work has focused on images. In a similar spirit, we view recently proposed neural video coding algorithms through the lens of deep autoregressive and latent variable modeling. We present recent neural video codecs as instances of a generalized stochastic temporal autoregressive transform, and propose new avenues for further improvements inspired by normalizing flows and structured priors. We propose several architectures that yield state-of-the-art video compression performance on full-resolution video and discuss their tradeoffs and ablations. In particular, we propose (i) improved temporal autoregressive transforms, (ii) improved entropy models with structured and temporal dependencies, and (iii) variable bitrate versions of our algorithms. Since our improvements are compatible with a large class of existing models, we provide further evidence that the generative modeling viewpoint can advance the neural video coding field.

[52]  arXiv:2107.13157 (cross-list from eess.IV) [pdf, other]
Title: Retinal Microvasculature as Biomarker for Diabetes and Cardiovascular Diseases
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Purpose: To demonstrate that retinal microvasculature per se is a reliable biomarker for Diabetic Retinopathy (DR) and, by extension, cardiovascular diseases. Methods: Deep Learning Convolutional Neural Networks (CNN) applied to color fundus images for semantic segmentation of the blood vessels and severity classification on both vascular and full images. Vessel reconstruction through harmonic descriptors is also used as a smoothing and de-noising tool. The mathematical background of the theory is also outlined. Results: For diabetic patients, at least 93.8% of DR No-Refer vs. Refer classification can be related to vasculature defects. As for the Non-Sight Threatening vs. Sight Threatening case, the ratio is as high as 96.7%. Conclusion: In the case of DR, most of the disease biomarkers are related topologically to the vasculature. Translational Relevance: Experiments conducted on eye blood vasculature reconstruction as a biomarker shows a strong correlation between vasculature shape and later stages of DR.

[53]  arXiv:2107.13180 (cross-list from cs.MM) [pdf, other]
Title: Squeeze-Excitation Convolutional Recurrent Neural Networks for Audio-Visual Scene Classification
Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)

The use of multiple and semantically correlated sources can provide complementary information to each other that may not be evident when working with individual modalities on their own. In this context, multi-modal models can help producing more accurate and robust predictions in machine learning tasks where audio-visual data is available. This paper presents a multi-modal model for automatic scene classification that exploits simultaneously auditory and visual information. The proposed approach makes use of two separate networks which are respectively trained in isolation on audio and visual data, so that each network specializes in a given modality. The visual subnetwork is a pre-trained VGG16 model followed by a bidiretional recurrent layer, while the residual audio subnetwork is based on stacked squeeze-excitation convolutional blocks trained from scratch. After training each subnetwork, the fusion of information from the audio and visual streams is performed at two different stages. The early fusion stage combines features resulting from the last convolutional block of the respective subnetworks at different time steps to feed a bidirectional recurrent structure. The late fusion stage combines the output of the early fusion stage with the independent predictions provided by the two subnetworks, resulting in the final prediction. We evaluate the method using the recently published TAU Audio-Visual Urban Scenes 2021, which contains synchronized audio and video recordings from 12 European cities in 10 different scene classes. The proposed model has been shown to provide an excellent trade-off between prediction performance (86.5%) and system complexity (15M parameters) in the evaluation results of the DCASE 2021 Challenge.

[54]  arXiv:2107.13200 (cross-list from eess.IV) [pdf]
Title: An explainable two-dimensional single model deep learning approach for Alzheimer's disease diagnosis and brain atrophy localization
Authors: Fan Zhang, Bo Pan, Pengfei Shao, Peng Liu (Alzheimer's Disease Neuroimaging Initiative, the Australian Imaging Biomarkers and Lifestyle flagship study of ageing), Shuwei Shen, Peng Yao, Ronald X. Xu
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Early and accurate diagnosis of Alzheimer's disease (AD) and its prodromal period mild cognitive impairment (MCI) is essential for the delayed disease progression and the improved quality of patients'life. The emerging computer-aided diagnostic methods that combine deep learning with structural magnetic resonance imaging (sMRI) have achieved encouraging results, but some of them are limit of issues such as data leakage and unexplainable diagnosis. In this research, we propose a novel end-to-end deep learning approach for automated diagnosis of AD and localization of important brain regions related to the disease from sMRI data. This approach is based on a 2D single model strategy and has the following differences from the current approaches: 1) Convolutional Neural Network (CNN) models of different structures and capacities are evaluated systemically and the most suitable model is adopted for AD diagnosis; 2) a data augmentation strategy named Two-stage Random RandAugment (TRRA) is proposed to alleviate the overfitting issue caused by limited training data and to improve the classification performance in AD diagnosis; 3) an explainable method of Grad-CAM++ is introduced to generate the visually explainable heatmaps that localize and highlight the brain regions that our model focuses on and to make our model more transparent. Our approach has been evaluated on two publicly accessible datasets for two classification tasks of AD vs. cognitively normal (CN) and progressive MCI (pMCI) vs. stable MCI (sMCI). The experimental results indicate that our approach outperforms the state-of-the-art approaches, including those using multi-model and 3D CNN methods. The resultant localization heatmaps from our approach also highlight the lateral ventricle and some disease-relevant regions of cortex, coincident with the commonly affected regions during the development of AD.

[55]  arXiv:2107.13237 (cross-list from eess.AS) [pdf]
Title: A Visual Domain Transfer Learning Approach for Heartbeat Sound Classification
Subjects: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)

Heart disease is the most common reason for human mortality that causes almost one-third of deaths throughout the world. Detecting the disease early increases the chances of survival of the patient and there are several ways a sign of heart disease can be detected early. This research proposes to convert cleansed and normalized heart sound into visual mel scale spectrograms and then using visual domain transfer learning approaches to automatically extract features and categorize between heart sounds. Some of the previous studies found that the spectrogram of various types of heart sounds is visually distinguishable to human eyes, which motivated this study to experiment on visual domain classification approaches for automated heart sound classification. It will use convolution neural network-based architectures i.e. ResNet, MobileNetV2, etc as the automated feature extractors from spectrograms. These well-accepted models in the image domain showed to learn generalized feature representations of cardiac sounds collected from different environments with varying amplitude and noise levels. Model evaluation criteria used were categorical accuracy, precision, recall, and AUROC as the chosen dataset is unbalanced. The proposed approach has been implemented on datasets A and B of the PASCAL heart sound collection and resulted in ~ 90% categorical accuracy and AUROC of ~0.97 for both sets.

[56]  arXiv:2107.13407 (cross-list from eess.IV) [pdf, other]
Title: High-speed object detection with a single-photon time-of-flight image sensor
Comments: 13 pages, 5 figures, 3 tables
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)

3D time-of-flight (ToF) imaging is used in a variety of applications such as augmented reality (AR), computer interfaces, robotics and autonomous systems. Single-photon avalanche diodes (SPADs) are one of the enabling technologies providing accurate depth data even over long ranges. By developing SPADs in array format with integrated processing combined with pulsed, flood-type illumination, high-speed 3D capture is possible. However, array sizes tend to be relatively small, limiting the lateral resolution of the resulting depth maps, and, consequently, the information that can be extracted from the image for applications such as object detection. In this paper, we demonstrate that these limitations can be overcome through the use of convolutional neural networks (CNNs) for high-performance object detection. We present outdoor results from a portable SPAD camera system that outputs 16-bin photon timing histograms with 64x32 spatial resolution. The results, obtained with exposure times down to 2 ms (equivalent to 500 FPS) and in signal-to-background (SBR) ratios as low as 0.05, point to the advantages of providing the CNN with full histogram data rather than point clouds alone. Alternatively, a combination of point cloud and active intensity data may be used as input, for a similar level of performance. In either case, the GPU-accelerated processing time is less than 1 ms per frame, leading to an overall latency (image acquisition plus processing) in the millisecond range, making the results relevant for safety-critical computer vision applications which would benefit from faster than human reaction times.

[57]  arXiv:2107.13431 (cross-list from eess.IV) [pdf]
Title: AI assisted method for efficiently generating breast ultrasound screening reports
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Ultrasound is the preferred choice for early screening of dense breast cancer. Clinically, doctors have to manually write the screening report which is time-consuming and laborious, and it is easy to miss and miswrite. Therefore, this paper proposes a method for efficiently generating personalized breast ultrasound screening preliminary reports by AI, especially for benign and normal cases which account for the majority. Doctors then make simple adjustments or corrections to quickly generate final reports. The proposed approach has been tested using a database of 1133 breast tumor instances. Experimental results indicate this pipeline improves doctors' work efficiency by up to 90%, which greatly reduces repetitive work.

[58]  arXiv:2107.13542 (cross-list from eess.IV) [pdf, other]
Title: TEDS-Net: Enforcing Diffeomorphisms in Spatial Transformers to Guarantee Topology Preservation in Segmentations
Comments: International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2021
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Accurate topology is key when performing meaningful anatomical segmentations, however, it is often overlooked in traditional deep learning methods. In this work we propose TEDS-Net: a novel segmentation method that guarantees accurate topology. Our method is built upon a continuous diffeomorphic framework, which enforces topology preservation. However, in practice, diffeomorphic fields are represented using a finite number of parameters and sampled using methods such as linear interpolation, violating the theoretical guarantees. We therefore introduce additional modifications to more strictly enforce it. Our network learns how to warp a binary prior, with the desired topological characteristics, to complete the segmentation task. We tested our method on myocardium segmentation from an open-source 2D heart dataset. TEDS-Net preserved topology in 100% of the cases, compared to 90% from the U-Net, without sacrificing on Hausdorff Distance or Dice performance. Code will be made available at: www.github.com/mwyburd/TEDS-Net

Replacements for Thu, 29 Jul 21

[59]  arXiv:2003.00826 (replaced) [pdf]
Title: Realistic River Image Synthesis using Deep Generative Adversarial Networks
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
[60]  arXiv:2003.01446 (replaced) [pdf, other]
Title: A New Dataset, Poisson GAN and AquaNet for Underwater Object Grabbing
Comments: 14 pages, 10 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[61]  arXiv:2004.03860 (replaced) [pdf, other]
Title: A Robust Method for Image Stitching
Journal-ref: Pattern Analysis and Applications, 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
[62]  arXiv:2007.08032 (replaced) [pdf, other]
Title: When and how do CNNs generalize to out-of-distribution category-viewpoint combinations?
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[63]  arXiv:2008.03064 (replaced) [pdf, other]
Title: Evaluating Efficient Performance Estimators of Neural Architectures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[64]  arXiv:2009.08825 (replaced) [pdf, other]
Title: Densely Guided Knowledge Distillation using Multiple Teacher Assistants
Comments: Accepted to ICCV 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[65]  arXiv:2011.10687 (replaced) [pdf, other]
Title: HDR Environment Map Estimation for Real-Time Augmented Reality
Comments: Supplementary video at this https URL Code at this https URL Accepted to CVPR 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
[66]  arXiv:2012.00257 (replaced) [pdf]
Title: Confluence: A Robust Non-IoU Alternative to Non-Maxima Suppression in Object Detection
Comments: 13 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[67]  arXiv:2012.14173 (replaced) [pdf, other]
Title: Playing to distraction: towards a robust training of CNN classifiers through visual explanation techniques
Comments: 20 pages,3 figures, 4 tables
Journal-ref: Neural Comput & Applic (2021)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[68]  arXiv:2103.04019 (replaced) [pdf, other]
Title: Indoor Future Person Localization from an Egocentric Wearable Camera
Comments: accepted as conference paper in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[69]  arXiv:2103.15244 (replaced) [pdf, other]
Title: Rethinking ResNets: Improved Stacking Strategies With High Order Schemes
Comments: 11 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[70]  arXiv:2103.15685 (replaced) [pdf, other]
Title: Adaptive Boosting for Domain Adaptation: Towards Robust Predictions in Scene Segmentation
Comments: 10 pages, 7 tables, 5 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[71]  arXiv:2103.16392 (replaced) [pdf, other]
Title: CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning
Comments: accepted by CVPR 2021, typos corrected, code link added
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[72]  arXiv:2104.10868 (replaced) [pdf, other]
Title: Towards Adversarial Patch Analysis and Certified Defense against Crowd Counting
Comments: Accepted by ACM Multimedia 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[73]  arXiv:2106.04066 (replaced) [pdf, other]
Title: Semantically Controllable Scene Generation with Guidance of Explicit Knowledge
Comments: 14 pages, 6 figures, Submitted to NeurIPS 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[74]  arXiv:2106.11154 (replaced) [pdf, other]
Title: Automatic Plant Cover Estimation with Convolutional Neural Networks
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[75]  arXiv:2106.12226 (replaced) [pdf, other]
Title: Spatio-Temporal SAR-Optical Data Fusion for Cloud Removal via a Deep Hierarchical Model
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
[76]  arXiv:2106.15281 (replaced) [pdf, other]
Title: On Board Volcanic Eruption Detection through CNNs and Satellite Multispectral Imagery
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[77]  arXiv:2107.02192 (replaced) [pdf, other]
Title: Long-Short Transformer: Efficient Transformers for Language and Vision
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
[78]  arXiv:2107.06749 (replaced) [pdf, other]
Title: Dynamic Event Camera Calibration
Comments: accepted in the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[79]  arXiv:2107.07058 (replaced) [pdf, other]
Title: A Generalized Framework for Edge-preserving and Structure-preserving Image Smoothing
Comments: This work is accepted by TPAMI. The code is available at this https URL arXiv admin note: substantial text overlap with arXiv:1907.09642
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
[80]  arXiv:2107.08369 (replaced) [pdf, other]
Title: Flood Segmentation on Sentinel-1 SAR Imagery with Semi-Supervised Learning
Comments: Equal authorship. This is a work in progress and is a submission to the Emerging Techniques in Computational Intelligence (ETCI) competition on Flood Detection. Code and models are available on GitHub
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[81]  arXiv:2107.11055 (replaced) [pdf, other]
Title: Transporting Causal Mechanisms for Unsupervised Domain Adaptation
Comments: ICCV 2021 Oral
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[82]  arXiv:2107.11646 (replaced) [pdf, other]
Title: Hand Image Understanding via Deep Multi-Task Learning
Comments: Accepted By ICCV 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[83]  arXiv:2107.11795 (replaced) [pdf]
Title: Character Spotting Using Machine Learning Techniques
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[84]  arXiv:2107.12087 (replaced) [pdf, other]
Title: Text is Text, No Matter What: Unifying Text Recognition using Knowledge Distillation
Comments: IEEE International Conference on Computer Vision (ICCV), 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[85]  arXiv:2107.12093 (replaced) [pdf]
Title: A Multiple-Instance Learning Approach for the Assessment of Gallbladder Vascularity from Laparoscopic Images
Comments: 6 pages, 5 tables, 2 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[86]  arXiv:2107.12429 (replaced) [pdf, other]
Title: MonoIndoor: Towards Good Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments
Comments: ICCV 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[87]  arXiv:2107.12664 (replaced) [pdf, other]
Title: Adaptive Boundary Proposal Network for Arbitrary Shape Text Detection
Comments: 10 pages, 8 figures, Accepted by ICCV2021
Journal-ref: ICCV2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[88]  arXiv:2107.12898 (replaced) [pdf, other]
Title: StarEnhancer: Learning Real-Time and Style-Aware Image Enhancement
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[89]  arXiv:1907.01845 (replaced) [pdf, other]
Title: FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search
Comments: Accepted to ICCV21
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[90]  arXiv:2012.03321 (replaced) [pdf, other]
Title: Global Unifying Intrinsic Calibration for Spinning and Solid-State LiDARs
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
[91]  arXiv:2012.15564 (replaced) [pdf, other]
Title: Exploiting Shared Knowledge from Non-COVID Lesions for Annotation-Efficient COVID-19 CT Lung Infection Segmentation
Comments: 12 pages
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[92]  arXiv:2103.01205 (replaced) [pdf, ps, other]
Title: Statistically Significant Stopping of Neural Network Training
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
[93]  arXiv:2104.03577 (replaced) [pdf, other]
Title: Stable deep neural network architectures for mitochondria segmentation on electron microscopy volumes
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[94]  arXiv:2106.03143 (replaced) [pdf, other]
Title: CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
[95]  arXiv:2107.09047 (replaced) [pdf, other]
Title: Know Thyself: Transferable Visuomotor Control Through Robot-Awareness
Comments: Website: this https URL Updated typos
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[96]  arXiv:2107.09405 (replaced) [pdf, other]
Title: DeepSMILE: Self-supervised heterogeneity-aware multiple instance learning for DNA damage response defect classification directly from H&E whole-slide images
Comments: Main paper: 16 pages, 5 figures, 2 tables
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[97]  arXiv:2107.11400 (replaced) [pdf, other]
Title: Robust Explainability: A Tutorial on Gradient-Based Attribution Methods for Deep Neural Networks
Comments: 21 pages, 3 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
[98]  arXiv:2107.11786 (replaced) [pdf, other]
Title: Deep Learning-based Frozen Section to FFPE Translation
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[99]  arXiv:2107.12469 (replaced) [pdf, ps, other]
Title: SaRNet: A Dataset for Deep Learning Assisted Search and Rescue with Satellite Imagery
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[ total of 99 entries: 1-99 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2107, contact, help  (Access key information)