We gratefully acknowledge support from
the Simons Foundation and member institutions.

Computer Vision and Pattern Recognition

New submissions

[ total of 241 entries: 1-241 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Tue, 19 Oct 21

[1]  arXiv:2110.08314 [pdf]
Title: 3D Human Pose Estimation for Free-form Activity Using WiFi Signals
Authors: Yili Ren, Jie Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

WiFi human sensing has become increasingly attractive in enabling emerging human-computer interaction applications. The corresponding technique has gradually evolved from the classification of multiple activity types to more fine-grained tracking of 3D human poses. However, existing WiFi-based 3D human pose tracking is limited to a set of predefined activities. In this work, we present Winect, a 3D human pose tracking system for free-form activity using commodity WiFi devices. Our system tracks free-form activity by estimating a 3D skeleton pose that consists of a set of joints of the human body. In particular, we combine signal separation and joint movement modeling to achieve free-form activity tracking. Our system first identifies the moving limbs by leveraging the two-dimensional angle of arrival of the signals reflected off the human body and separates the entangled signals for each limb. Then, it tracks each limb and constructs a 3D skeleton of the body by modeling the inherent relationship between the movements of the limb and the corresponding joints. Our evaluation results show that Winect is environment-independent and achieves centimeter-level accuracy for free-form activity tracking under various challenging environments including the none-line-of-sight (NLoS) scenarios.

[2]  arXiv:2110.08327 [pdf, other]
Title: Solving Image PDEs with a Shallow Network
Comments: 21 pages, 22 figures, references arXiv:1802.06130, arXiv:1711.10700, arXiv:1606.01299
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Dynamical Systems (math.DS)

Partial differential equations (PDEs) are typically used as models of physical processes but are also of great interest in PDE-based image processing. However, when it comes to their use in imaging, conventional numerical methods for solving PDEs tend to require very fine grid resolution for stability, and as a result have impractically high computational cost. This work applies BLADE (Best Linear Adaptive Enhancement), a shallow learnable filtering framework, to PDE solving, and shows that the resulting approach is efficient and accurate, operating more reliably at coarse grid resolutions than classical methods. As such, the model can be flexibly used for a wide variety of problems in imaging.

[3]  arXiv:2110.08335 [pdf, other]
Title: Trigger Hunting with a Topological Prior for Trojan Detection
Comments: 16 pages, 9 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)

Despite their success and popularity, deep neural networks (DNNs) are vulnerable when facing backdoor attacks. This impedes their wider adoption, especially in mission critical applications. This paper tackles the problem of Trojan detection, namely, identifying Trojaned models -- models trained with poisoned data. One popular approach is reverse engineering, i.e., recovering the triggers on a clean image by manipulating the model's prediction. One major challenge of reverse engineering approach is the enormous search space of triggers. To this end, we propose innovative priors such as diversity and topological simplicity to not only increase the chances of finding the appropriate triggers but also improve the quality of the found triggers. Moreover, by encouraging a diverse set of trigger candidates, our method can perform effectively in cases with unknown target labels. We demonstrate that these priors can significantly improve the quality of the recovered triggers, resulting in substantially improved Trojan detection accuracy as validated on both synthetic and publicly available TrojAI benchmarks.

[4]  arXiv:2110.08365 [pdf, other]
Title: Counting Objects by Diffused Index: geometry-free and training-free approach
Authors: Mengyi Tang (1), Maryam Yashtini (2), Sung Ha Kang (1) ((1) Georgia Institute of Technology, (2) Georgetown University )
Subjects: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)

Counting objects is a fundamental but challenging problem. In this paper, we propose diffusion-based, geometry-free, and learning-free methodologies to count the number of objects in images. The main idea is to represent each object by a unique index value regardless of its intensity or size, and to simply count the number of index values. First, we place different vectors, refer to as seed vectors, uniformly throughout the mask image. The mask image has boundary information of the objects to be counted. Secondly, the seeds are diffused using an edge-weighted harmonic variational optimization model within each object. We propose an efficient algorithm based on an operator splitting approach and alternating direction minimization method, and theoretical analysis of this algorithm is given. An optimal solution of the model is obtained when the distributed seeds are completely diffused such that there is a unique intensity within each object, which we refer to as an index. For computational efficiency, we stop the diffusion process before a full convergence, and propose to cluster these diffused index values. We refer to this approach as Counting Objects by Diffused Index (CODI). We explore scalar and multi-dimensional seed vectors. For Scalar seeds, we use Gaussian fitting in histogram to count, while for vector seeds, we exploit a high-dimensional clustering method for the final step of counting via clustering. The proposed method is flexible even if the boundary of the object is not clear nor fully enclosed. We present counting results in various applications such as biological cells, agriculture, concert crowd, and transportation. Some comparisons with existing methods are presented.

[5]  arXiv:2110.08396 [pdf, other]
Title: Comparing Human and Machine Bias in Face Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)

Much recent research has uncovered and discussed serious concerns of bias in facial analysis technologies, finding performance disparities between groups of people based on perceived gender, skin type, lighting condition, etc. These audits are immensely important and successful at measuring algorithmic bias but have two major challenges: the audits (1) use facial recognition datasets which lack quality metadata, like LFW and CelebA, and (2) do not compare their observed algorithmic bias to the biases of their human alternatives. In this paper, we release improvements to the LFW and CelebA datasets which will enable future researchers to obtain measurements of algorithmic bias that are not tainted by major flaws in the dataset (e.g. identical images appearing in both the gallery and test set). We also use these new data to develop a series of challenging facial identification and verification questions that we administered to various algorithms and a large, balanced sample of human reviewers. We find that both computer models and human survey participants perform significantly better at the verification task, generally obtain lower accuracy rates on dark-skinned or female subjects for both tasks, and obtain higher accuracy rates when their demographics match that of the question. Computer models are observed to achieve a higher level of accuracy than the survey participants on both tasks and exhibit bias to similar degrees as the human survey participants.

[6]  arXiv:2110.08398 [pdf, other]
Title: Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks
Comments: Video: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We present a new method for one shot domain adaptation. The input to our method is trained GAN that can produce images in domain A and a single reference image I_B from domain B. The proposed algorithm can translate any output of the trained GAN from domain A to domain B. There are two main advantages of our method compared to the current state of the art: First, our solution achieves higher visual quality, e.g. by noticeably reducing overfitting. Second, our solution allows for more degrees of freedom to control the domain gap, i.e. what aspects of image I_B are used to define the domain B. Technically, we realize the new method by building on a pre-trained StyleGAN generator as GAN and a pre-trained CLIP model for representing the domain gap. We propose several new regularizers for controlling the domain gap to optimize the weights of the pre-trained StyleGAN generator to output images in domain B instead of domain A. The regularizers prevent the optimization from taking on too many attributes of the single reference image. Our results show significant visual improvements over the state of the art as well as multiple applications that highlight improved control.

[7]  arXiv:2110.08421 [pdf, other]
Title: Dataset Knowledge Transfer for Class-Incremental Learning without Memory
Comments: Accepted to WACV 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Incremental learning enables artificial agents to learn from sequential data. While important progress was made by exploiting deep neural networks, incremental learning remains very challenging. This is particularly the case when no memory of past data is allowed and catastrophic forgetting has a strong negative effect. We tackle class-incremental learning without memory by adapting prediction bias correction, a method which makes predictions of past and new classes more comparable. It was proposed when a memory is allowed and cannot be directly used without memory, since samples of past classes are required. We introduce a two-step learning process which allows the transfer of bias correction parameters between reference and target datasets. Bias correction is first optimized offline on reference datasets which have an associated validation memory. The obtained correction parameters are then transferred to target datasets, for which no memory is available. The second contribution is to introduce a finer modeling of bias correction by learning its parameters per incremental state instead of the usual past vs. new class modeling. The proposed dataset knowledge transfer is applicable to any incremental method which works without memory. We test its effectiveness by applying it to four existing methods. Evaluation with four target datasets and different configurations shows consistent improvement, with practically no computational and memory overhead.

[8]  arXiv:2110.08429 [pdf, other]
Title: TorchEsegeta: Framework for Interpretability and Explainability of Image-based Deep Learning Models
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Clinicians are often very sceptical about applying automatic image processing approaches, especially deep learning based methods, in practice. One main reason for this is the black-box nature of these approaches and the inherent problem of missing insights of the automatically derived decisions. In order to increase trust in these methods, this paper presents approaches that help to interpret and explain the results of deep learning algorithms by depicting the anatomical areas which influence the decision of the algorithm most. Moreover, this research presents a unified framework, TorchEsegeta, for applying various interpretability and explainability techniques for deep learning models and generate visual interpretations and explanations for clinicians to corroborate their clinical findings. In addition, this will aid in gaining confidence in such methods. The framework builds on existing interpretability and explainability techniques that are currently focusing on classification models, extending them to segmentation tasks. In addition, these methods have been adapted to 3D models for volumetric analysis. The proposed framework provides methods to quantitatively compare visual explanations using infidelity and sensitivity metrics. This framework can be used by data scientists to perform post-hoc interpretations and explanations of their models, develop more explainable tools and present the findings to clinicians to increase their faith in such models. The proposed framework was evaluated based on a use case scenario of vessel segmentation models trained on Time-of-fight (TOF) Magnetic Resonance Angiogram (MRA) images of the human brain. Quantitative and qualitative results of a comparative study of different models and interpretability methods are presented. Furthermore, this paper provides an extensive overview of several existing interpretability and explainability methods.

[9]  arXiv:2110.08472 [pdf, other]
Title: Joint 3D Human Shape Recovery from A Single Imag with Bilayer-Graph
Comments: 3DV'21
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The ability to estimate the 3D human shape and pose from images can be useful in many contexts. Recent approaches have explored using graph convolutional networks and achieved promising results. The fact that the 3D shape is represented by a mesh, an undirected graph, makes graph convolutional networks a natural fit for this problem. However, graph convolutional networks have limited representation power. Information from nodes in the graph is passed to connected neighbors, and propagation of information requires successive graph convolutions. To overcome this limitation, we propose a dual-scale graph approach. We use a coarse graph, derived from a dense graph, to estimate the human's 3D pose, and the dense graph to estimate the 3D shape. Information in coarse graphs can be propagated over longer distances compared to dense graphs. In addition, information about pose can guide to recover local shape detail and vice versa. We recognize that the connection between coarse and dense is itself a graph, and introduce graph fusion blocks to exchange information between graphs with different scales. We train our model end-to-end and show that we can achieve state-of-the-art results for several evaluation datasets.

[10]  arXiv:2110.08484 [pdf, other]
Title: A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based Learning for Vision-Language Models
Comments: Preprint
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

Large pretrained vision-language (VL) models can learn a new task with a handful of examples or generalize to a new task without fine-tuning. However, these gigantic VL models are hard to deploy for real-world applications due to their impractically huge model size and slow inference speed. In this work, we propose FewVLM, a few-shot prompt-based learner on vision-language tasks. We pretrain a sequence-to-sequence Transformer model with both prefix language modeling (PrefixLM) and masked language modeling (MaskedLM), and introduce simple prompts to improve zero-shot and few-shot performance on VQA and image captioning. Experimental results on five VQA and captioning datasets show that \method\xspace outperforms Frozen which is 31 times larger than ours by 18.2% point on zero-shot VQAv2 and achieves comparable results to a 246$\times$ larger model, PICa. We observe that (1) prompts significantly affect zero-shot performance but marginally affect few-shot performance, (2) MaskedLM helps few-shot VQA tasks while PrefixLM boosts captioning performance, and (3) performance significantly increases when training set size is small.

[11]  arXiv:2110.08493 [pdf]
Title: Grayscale Based Algorithm for Remote Sensing with Deep Learning
Comments: 14 pages, 4 figures, Journal Expert Systems with Applications
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Remote sensing is the image acquisition of a target without having physical contact with it. Nowadays remote sensing data is widely preferred due to its reduced image acquisition period. The remote sensing of ground targets is more challenging because of the various factors that affect the propagation of light through different mediums from a satellite acquisition. Several Convolutional Neural Network-based algorithms are being implemented in the field of remote sensing. Supervised learning is a machine learning technique where the data is labelled according to their classes prior to the training. In order to detect and classify the targets more accurately, YOLOv3, an algorithm based on bounding and anchor boxes is adopted. In order to handle the various effects of light travelling through the atmosphere, Grayscale based YOLOv3 configuration is introduced. For better prediction and for solving the Rayleigh scattering effect, RGB based grayscale algorithms are proposed. The acquired images are analysed and trained with the grayscale based YOLO3 algorithm for target detection. The results show that the grayscale-based method can sense the target more accurately and effectively than the traditional YOLOv3 approach.

[12]  arXiv:2110.08495 [pdf, other]
Title: Hybrid Mutimodal Fusion for Dimensional Emotion Recognition
Comments: 8 pages, 2 figures, accepted by ACM MM2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we extensively present our solutions for the MuSe-Stress sub-challenge and the MuSe-Physio sub-challenge of Multimodal Sentiment Challenge (MuSe) 2021. The goal of MuSe-Stress sub-challenge is to predict the level of emotional arousal and valence in a time-continuous manner from audio-visual recordings and the goal of MuSe-Physio sub-challenge is to predict the level of psycho-physiological arousal from a) human annotations fused with b) galvanic skin response (also known as Electrodermal Activity (EDA)) signals from the stressed people. The Ulm-TSST dataset which is a novel subset of the audio-visual textual Ulm-Trier Social Stress dataset that features German speakers in a Trier Social Stress Test (TSST) induced stress situation is used in both sub-challenges. For the MuSe-Stress sub-challenge, we highlight our solutions in three aspects: 1) the audio-visual features and the bio-signal features are used for emotional state recognition. 2) the Long Short-Term Memory (LSTM) with the self-attention mechanism is utilized to capture complex temporal dependencies within the feature sequences. 3) the late fusion strategy is adopted to further boost the model's recognition performance by exploiting complementary information scattered across multimodal sequences. Our proposed model achieves CCC of 0.6159 and 0.4609 for valence and arousal respectively on the test set, which both rank in the top 3. For the MuSe-Physio sub-challenge, we first extract the audio-visual features and the bio-signal features from multiple modalities. Then, the LSTM module with the self-attention mechanism, and the Gated Convolutional Neural Networks (GCNN) as well as the LSTM network are utilized for modeling the complex temporal dependencies in the sequence. Finally, the late fusion strategy is used. Our proposed method also achieves CCC of 0.5412 on the test set, which ranks in the top 3.

[13]  arXiv:2110.08556 [pdf, other]
Title: Multi-View Stereo Network with attention thin volume
Authors: Zihang Wan
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We propose an efficient multi-view stereo (MVS) network for infering depth value from multiple RGB images. Recent studies have shown that mapping the geometric relationship in real space to neural network is an essential topic of the MVS problem. Specifically, these methods focus on how to express the correspondence between different views by constructing a nice cost volume. In this paper, we propose a more complete cost volume construction approach based on absorbing previous experience. First of all, we introduce the self-attention mechanism to fully aggregate the dominant information from input images and accurately model the long-range dependency, so as to selectively aggregate reference features. Secondly, we introduce the group-wise correlation to feature aggregation, which greatly reduces the memory and calculation burden. Meanwhile, this method enhances the information interaction between different feature channels. With this approach, a more lightweight and efficient cost volume is constructed. Finally we follow the coarse to fine strategy and refine the depth sampling range scale by scale with the help of uncertainty estimation. We further combine the previous steps to get the attention thin volume. Quantitative and qualitative experiments are presented to demonstrate the performance of our model.

[14]  arXiv:2110.08558 [pdf, other]
Title: Neural Network Pruning Through Constrained Reinforcement Learning
Comments: Submitted to ICASSP 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Network pruning reduces the size of neural networks by removing (pruning) neurons such that the performance drop is minimal. Traditional pruning approaches focus on designing metrics to quantify the usefulness of a neuron which is often quite tedious and sub-optimal. More recent approaches have instead focused on training auxiliary networks to automatically learn how useful each neuron is however, they often do not take computational limitations into account. In this work, we propose a general methodology for pruning neural networks. Our proposed methodology can prune neural networks to respect pre-defined computational budgets on arbitrary, possibly non-differentiable, functions. Furthermore, we only assume the ability to be able to evaluate these functions for different inputs, and hence they do not need to be fully specified beforehand. We achieve this by proposing a novel pruning strategy via constrained reinforcement learning algorithms. We prove the effectiveness of our approach via comparison with state-of-the-art methods on standard image classification datasets. Specifically, we reduce 83-92.90 of total parameters on various variants of VGG while achieving comparable or better performance than that of original networks. We also achieved 75.09 reduction in parameters on ResNet18 without incurring any loss in accuracy.

[15]  arXiv:2110.08562 [pdf, other]
Title: BNAS v2: Learning Architectures for Binary Networks with Empirical Improvements
Comments: arXiv admin note: text overlap with arXiv:2002.06963
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Backbone architectures of most binary networks are well-known floating point (FP) architectures such as the ResNet family. Questioning that the architectures designed for FP networks might not be the best for binary networks, we propose to search architectures for binary networks (BNAS) by defining a new search space for binary architectures and a novel search objective. Specifically, based on the cell based search method, we define the new search space of binary layer types, design a new cell template, and rediscover the utility of and propose to use the Zeroise layer instead of using it as a placeholder. The novel search objective diversifies early search to learn better performing binary architectures. We show that our method searches architectures with stable training curves despite the quantization error inherent in binary networks. Quantitative analyses demonstrate that our searched architectures outperform the architectures used in state-of-the-art binary networks and outperform or perform on par with state-of-the-art binary networks that employ various techniques other than architectural changes. In addition, we further propose improvements to the training scheme of our searched architectures. With the new training scheme for our searched architectures, we achieve the state-of-the-art performance by binary networks by outperforming all previous methods by non-trivial margins.

[16]  arXiv:2110.08568 [pdf, other]
Title: ASFormer: Transformer for Action Segmentation
Comments: Accepted by BMVC 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Algorithms for the action segmentation task typically use temporal models to predict what action is occurring at each frame for a minute-long daily activity. Recent studies have shown the potential of Transformer in modeling the relations among elements in sequential data. However, there are several major concerns when directly applying the Transformer to the action segmentation task, such as the lack of inductive biases with small training sets, the deficit in processing long input sequence, and the limitation of the decoder architecture to utilize temporal relations among multiple action segments to refine the initial predictions. To address these concerns, we design an efficient Transformer-based model for action segmentation task, named ASFormer, with three distinctive characteristics: (i) We explicitly bring in the local connectivity inductive priors because of the high locality of features. It constrains the hypothesis space within a reliable scope, and is beneficial for the action segmentation task to learn a proper target function with small training sets. (ii) We apply a pre-defined hierarchical representation pattern that efficiently handles long input sequences. (iii) We carefully design the decoder to refine the initial predictions from the encoder. Extensive experiments on three public datasets demonstrate that effectiveness of our methods. Code is available at \url{https://github.com/ChinaYi/ASFormer}.

[17]  arXiv:2110.08571 [pdf, other]
Title: Explore before Moving: A Feasible Path Estimation and Memory Recalling Framework for Embodied Navigation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

An embodied task such as embodied question answering (EmbodiedQA), requires an agent to explore the environment and collect clues to answer a given question that related with specific objects in the scene. The solution of such task usually includes two stages, a navigator and a visual Q&A module. In this paper, we focus on the navigation and solve the problem of existing navigation algorithms lacking experience and common sense, which essentially results in a failure finding target when robot is spawn in unknown environments.
Inspired by the human ability to think twice before moving and conceive several feasible paths to seek a goal in unfamiliar scenes, we present a route planning method named Path Estimation and Memory Recalling (PEMR) framework. PEMR includes a "looking ahead" process, \textit{i.e.} a visual feature extractor module that estimates feasible paths for gathering 3D navigational information, which is mimicking the human sense of direction. PEMR contains another process ``looking behind'' process that is a memory recall mechanism aims at fully leveraging past experience collected by the feature extractor. Last but not the least, to encourage the navigator to learn more accurate prior expert experience, we improve the original benchmark dataset and provide a family of evaluation metrics for diagnosing both navigation and question answering modules. We show strong experimental results of PEMR on the EmbodiedQA navigation task.

[18]  arXiv:2110.08578 [pdf, other]
Title: Visual-aware Attention Dual-stream Decoder for Video Captioning
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Video captioning is a challenging task that captures different visual parts and describes them in sentences, for it requires visual and linguistic coherence. The attention mechanism in the current video captioning method learns to assign weight to each frame, promoting the decoder dynamically. This may not explicitly model the correlation and the temporal coherence of the visual features extracted in the sequence frames.To generate semantically coherent sentences, we propose a new Visual-aware Attention (VA) model, which concatenates dynamic changes of temporal sequence frames with the words at the previous moment, as the input of attention mechanism to extract sequence features.In addition, the prevalent approaches widely use the teacher-forcing (TF) learning during training, where the next token is generated conditioned on the previous ground-truth tokens. The semantic information in the previously generated tokens is lost. Therefore, we design a self-forcing (SF) stream that takes the semantic information in the probability distribution of the previous token as input to enhance the current token.The Dual-stream Decoder (DD) architecture unifies the TF and SF streams, generating sentences to promote the annotated captioning for both streams.Meanwhile, with the Dual-stream Decoder utilized, the exposure bias problem is alleviated, caused by the discrepancy between the training and testing in the TF learning.The effectiveness of the proposed Visual-aware Attention Dual-stream Decoder (VADD) is demonstrated through the result of experimental studies on Microsoft video description (MSVD) corpus and MSR-Video to text (MSR-VTT) datasets.

[19]  arXiv:2110.08580 [pdf, other]
Title: Intelligent Video Editing: Incorporating Modern Talking Face Generation Algorithms in a Video Editor
Comments: 9 pages, 7 figures, accepted in ICVGIP 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper proposes a video editor based on OpenShot with several state-of-the-art facial video editing algorithms as added functionalities. Our editor provides an easy-to-use interface to apply modern lip-syncing algorithms interactively. Apart from lip-syncing, the editor also uses audio and facial re-enactment to generate expressive talking faces. The manual control improves the overall experience of video editing without missing out on the benefits of modern synthetic video generation algorithms. This control enables us to lip-sync complex dubbed movie scenes, interviews, television shows, and other visual content. Furthermore, our editor provides features that automatically translate lectures from spoken content, lip-sync of the professor, and background content like slides. While doing so, we also tackle the critical aspect of synchronizing background content with the translated speech. We qualitatively evaluate the usefulness of the proposed editor by conducting human evaluations. Our evaluations show a clear improvement in the efficiency of using human editors and an improved video generation quality. We attach demo videos with the supplementary material clearly explaining the tool and also showcasing multiple results.

[20]  arXiv:2110.08589 [pdf, other]
Title: Pseudo-label refinement using superpixels for semi-supervised brain tumour segmentation
Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Training neural networks using limited annotations is an important problem in the medical domain. Deep Neural Networks (DNNs) typically require large, annotated datasets to achieve acceptable performance which, in the medical domain, are especially difficult to obtain as they require significant time from expert radiologists. Semi-supervised learning aims to overcome this problem by learning segmentations with very little annotated data, whilst exploiting large amounts of unlabelled data. However, the best-known technique, which utilises inferred pseudo-labels, is vulnerable to inaccurate pseudo-labels degrading the performance. We propose a framework based on superpixels - meaningful clusters of adjacent pixels - to improve the accuracy of the pseudo labels and address this issue. Our framework combines superpixels with semi-supervised learning, refining the pseudo-labels during training using the features and edges of the superpixel maps. This method is evaluated on a multimodal magnetic resonance imaging (MRI) dataset for the task of brain tumour region segmentation. Our method demonstrates improved performance over the standard semi-supervised pseudo-labelling baseline when there is a reduced annotator burden and only 5 annotated patients are available. We report DSC=0.824 and DSC=0.707 for the test set whole tumour and tumour core regions respectively.

[21]  arXiv:2110.08590 [pdf, other]
Title: Automated Remote Sensing Forest Inventory Using Satelite Imagery
Comments: 15 pages, 11 figures, 71th International Astronautical Congress (IAC) - The CyberSpace Edition
Subjects: Computer Vision and Pattern Recognition (cs.CV)

For many countries like Russia, Canada, or the USA, a robust and detailed tree species inventory is essential to manage their forests sustainably. Since one can not apply unmanned aerial vehicle (UAV) imagery-based approaches to large-scale forest inventory applications, the utilization of machine learning algorithms on satellite imagery is a rising topic of research. Although satellite imagery quality is relatively low, additional spectral channels provide a sufficient amount of information for tree crown classification tasks. Assuming that tree crowns are detected already, we use embeddings of tree crowns generated by Autoencoders as a data set to train classical Machine Learning algorithms. We compare our Autoencoder (AE) based approach to traditional convolutional neural networks (CNN) end-to-end classifiers.

[22]  arXiv:2110.08636 [pdf, other]
Title: DPC: Unsupervised Deep Point Correspondence via Cross and Self Construction
Comments: 3DV 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We present a new method for real-time non-rigid dense correspondence between point clouds based on structured shape construction. Our method, termed Deep Point Correspondence (DPC), requires a fraction of the training data compared to previous techniques and presents better generalization capabilities. Until now, two main approaches have been suggested for the dense correspondence problem. The first is a spectral-based approach that obtains great results on synthetic datasets but requires mesh connectivity of the shapes and long inference processing time while being unstable in real-world scenarios. The second is a spatial approach that uses an encoder-decoder framework to regress an ordered point cloud for the matching alignment from an irregular input. Unfortunately, the decoder brings considerable disadvantages, as it requires a large amount of training data and struggles to generalize well in cross-dataset evaluations. DPC's novelty lies in its lack of a decoder component. Instead, we use latent similarity and the input coordinates themselves to construct the point cloud and determine correspondence, replacing the coordinate regression done by the decoder. Extensive experiments show that our construction scheme leads to a performance boost in comparison to recent state-of-the-art correspondence methods. Our code is publicly available at https://github.com/dvirginz/DPC.

[23]  arXiv:2110.08667 [pdf, other]
Title: Face Verification with Challenging Imposters and Diversified Demographics
Authors: Adrian Popescu (1), Liviu-Daniel Ştefan (2), Jérôme Deshayes-Chossart (1), Bogdan Ionescu (2) ((1) Université Paris-Saclay, CEA, List, Palaiseau, France, (2) University Politehnica of Bucharest, Romania)
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Face verification aims to distinguish between genuine and imposter pairs of faces, which include the same or different identities, respectively. The performance reported in recent years gives the impression that the task is practically solved. Here, we revisit the problem and argue that existing evaluation datasets were built using two oversimplifying design choices. First, the usual identity selection to form imposter pairs is not challenging enough because, in practice, verification is needed to detect challenging imposters. Second, the underlying demographics of existing datasets are often insufficient to account for the wide diversity of facial characteristics of people from across the world. To mitigate these limitations, we introduce the $FaVCI2D$ dataset. Imposter pairs are challenging because they include visually similar faces selected from a large pool of demographically diversified identities. The dataset also includes metadata related to gender, country and age to facilitate fine-grained analysis of results. $FaVCI2D$ is generated from freely distributable resources. Experiments with state-of-the-art deep models that provide nearly 100\% performance on existing datasets show a significant performance drop for $FaVCI2D$, confirming our starting hypothesis. Equally important, we analyze legal and ethical challenges which appeared in recent years and hindered the development of face analysis research. We introduce a series of design choices which address these challenges and make the dataset constitution and usage more sustainable and fairer. $FaVCI2D$ is available at~\url{https://github.com/AIMultimediaLab/FaVCI2D-Face-Verification-with-Challenging-Imposters-and-Diversified-Demographics}.

[24]  arXiv:2110.08679 [pdf]
Title: An Acceleration Method Based on Deep Learning and Multilinear Feature Space
Comments: 20 pages, International Journal of Artificial Intelligence and Applications
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Computer vision plays a crucial role in Advanced Assistance Systems. Most computer vision systems are based on Deep Convolutional Neural Networks (deep CNN) architectures. However, the high computational resource to run a CNN algorithm is demanding. Therefore, the methods to speed up computation have become a relevant research issue. Even though several works on architecture reduction found in the literature have not yet been achieved satisfactory results for embedded real-time system applications. This paper presents an alternative approach based on the Multilinear Feature Space (MFS) method resorting to transfer learning from large CNN architectures. The proposed method uses CNNs to generate feature maps, although it does not work as complexity reduction approach. After the training process, the generated features maps are used to create vector feature space. We use this new vector space to make projections of any new sample to classify them. Our method, named AMFC, uses the transfer learning from pre-trained CNN to reduce the classification time of new sample image, with minimal accuracy loss. Our method uses the VGG-16 model as the base CNN architecture for experiments; however, the method works with any similar CNN model. Using the well-known Vehicle Image Database and the German Traffic Sign Recognition Benchmark, we compared the classification time of the original VGG-16 model with the AMFC method, and our method is, on average, 17 times faster. The fast classification time reduces the computational and memory demands in embedded applications requiring a large CNN architecture.

[25]  arXiv:2110.08702 [pdf, other]
Title: SIN:Superpixel Interpolation Network
Comments: 15 pages, 8 figures, to be published in PRICAI-2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Superpixels have been widely used in computer vision tasks due to their representational and computational efficiency. Meanwhile, deep learning and end-to-end framework have made great progress in various fields including computer vision. However, existing superpixel algorithms cannot be integrated into subsequent tasks in an end-to-end way. Traditional algorithms and deep learning-based algorithms are two main streams in superpixel segmentation. The former is non-differentiable and the latter needs a non-differentiable post-processing step to enforce connectivity, which constraints the integration of superpixels and downstream tasks. In this paper, we propose a deep learning-based superpixel segmentation algorithm SIN which can be integrated with downstream tasks in an end-to-end way. Owing to some downstream tasks such as visual tracking require real-time speed, the speed of generating superpixels is also important. To remove the post-processing step, our algorithm enforces spatial connectivity from the start. Superpixels are initialized by sampled pixels and other pixels are assigned to superpixels through multiple updating steps. Each step consists of a horizontal and a vertical interpolation, which is the key to enforcing spatial connectivity. Multi-layer outputs of a fully convolutional network are utilized to predict association scores for interpolations. Experimental results show that our approach runs at about 80fps and performs favorably against state-of-the-art methods. Furthermore, we design a simple but effective loss function which reduces much training time. The improvements of superpixel-based tasks demonstrate the effectiveness of our algorithm. We hope SIN will be integrated into downstream tasks in an end-to-end way and benefit the superpixel-based community. Code is available at: \href{https://github.com/yuanqqq/SIN}{https://github.com/yuanqqq/SIN}.

[26]  arXiv:2110.08708 [pdf, other]
Title: Robust Pedestrian Attribute Recognition Using Group Sparsity for Occlusion Videos
Comments: 35 pages, 9 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Occlusion processing is a key issue in pedestrian attribute recognition (PAR). Nevertheless, several existing video-based PAR methods have not yet considered occlusion handling in depth. In this paper, we formulate finding non-occluded frames as sparsity-based temporal attention of a crowded video. In this manner, a model is guided not to pay attention to the occluded frame. However, temporal sparsity cannot include a correlation between attributes when occlusion occurs. For example, "boots" and "shoe color" cannot be recognized when the foot is invisible. To solve the uncorrelated attention issue, we also propose a novel group sparsity-based temporal attention module. Group sparsity is applied across attention weights in correlated attributes. Thus, attention weights in a group are forced to pay attention to the same frames. Experimental results showed that the proposed method achieved a higher F1-score than the state-of-the-art methods on two video-based PAR datasets and five occlusion scenarios.

[27]  arXiv:2110.08718 [pdf, other]
Title: AE-StyleGAN: Improved Training of Style-Based Auto-Encoders
Comments: Accepted at WACV-22
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

StyleGANs have shown impressive results on data generation and manipulation in recent years, thanks to its disentangled style latent space. A lot of efforts have been made in inverting a pretrained generator, where an encoder is trained ad hoc after the generator is trained in a two-stage fashion. In this paper, we focus on style-based generators asking a scientific question: Does forcing such a generator to reconstruct real data lead to more disentangled latent space and make the inversion process from image to latent space easy? We describe a new methodology to train a style-based autoencoder where the encoder and generator are optimized end-to-end. We show that our proposed model consistently outperforms baselines in terms of image inversion and generation quality. Supplementary, code, and pretrained models are available on the project website.

[28]  arXiv:2110.08729 [pdf, other]
Title: VoteHMR: Occlusion-Aware Voting Network for Robust 3D Human Mesh Recovery from Partial Point Clouds
Comments: Our paper are accepted to MM 2021 as oral
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

3D human mesh recovery from point clouds is essential for various tasks, including AR/VR and human behavior understanding. Previous works in this field either require high-quality 3D human scans or sequential point clouds, which cannot be easily applied to low-quality 3D scans captured by consumer-level depth sensors. In this paper, we make the first attempt to reconstruct reliable 3D human shapes from single-frame partial point clouds.To achieve this, we propose an end-to-end learnable method, named VoteHMR. The core of VoteHMR is a novel occlusion-aware voting network that can first reliably produce visible joint-level features from the input partial point clouds, and then complete the joint-level features through the kinematic tree of the human skeleton. Compared with holistic features used by previous works, the joint-level features can not only effectively encode the human geometry information but also be robust to noisy inputs with self-occlusions and missing areas. By exploiting the rich complementary clues from the joint-level features and global features from the input point clouds, the proposed method encourages reliable and disentangled parameter predictions for statistical 3D human models, such as SMPL. The proposed method achieves state-of-the-art performances on two large-scale datasets, namely SURREAL and DFAUST. Furthermore, VoteHMR also demonstrates superior generalization ability on real-world datasets, such as Berkeley MHAD.

[29]  arXiv:2110.08732 [pdf]
Title: A Deep Learning-based Approach for Real-time Facemask Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The COVID-19 pandemic is causing a global health crisis. Public spaces need to be safeguarded from the adverse effects of this pandemic. Wearing a facemask becomes one of the effective protection solutions adopted by many governments. Manual real-time monitoring of facemask wearing for a large group of people is becoming a difficult task. The goal of this paper is to use deep learning (DL), which has shown excellent results in many real-life applications, to ensure efficient real-time facemask detection. The proposed approach is based on two steps. An off-line step aiming to create a DL model that is able to detect and locate facemasks and whether they are appropriately worn. An online step that deploys the DL model at edge computing in order to detect masks in real-time. In this study, we propose to use MobileNetV2 to detect facemask in real-time. Several experiments are conducted and show good performances of the proposed approach (99% for training and testing accuracy). In addition, several comparisons with many state-of-the-art models namely ResNet50, DenseNet, and VGG16 show good performance of the MobileNetV2 in terms of training time and accuracy.

[30]  arXiv:2110.08733 [pdf, other]
Title: LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation
Comments: Accepted by NeurIPS 2021 Datasets and Benchmarks Track
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Deep learning approaches have shown promising results in remote sensing high spatial resolution (HSR) land-cover mapping. However, urban and rural scenes can show completely different geographical landscapes, and the inadequate generalizability of these algorithms hinders city-level or national-level mapping. Most of the existing HSR land-cover datasets mainly promote the research of learning semantic representation, thereby ignoring the model transferability. In this paper, we introduce the Land-cOVEr Domain Adaptive semantic segmentation (LoveDA) dataset to advance semantic and transferable learning. The LoveDA dataset contains 5927 HSR images with 166768 annotated objects from three different cities. Compared to the existing datasets, the LoveDA dataset encompasses two domains (urban and rural), which brings considerable challenges due to the: 1) multi-scale objects; 2) complex background samples; and 3) inconsistent class distributions. The LoveDA dataset is suitable for both land-cover semantic segmentation and unsupervised domain adaptation (UDA) tasks. Accordingly, we benchmarked the LoveDA dataset on eleven semantic segmentation methods and eight UDA methods. Some exploratory studies including multi-scale architectures and strategies, additional background supervision, and pseudo-label analysis were also carried out to address these challenges. The code and data are available at https://github.com/Junjue-Wang/LoveDA.

[31]  arXiv:2110.08754 [pdf, other]
Title: Fully-Connected Tensor Network Decomposition for Robust Tensor Completion Problem
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The robust tensor completion (RTC) problem, which aims to reconstruct a low-rank tensor from partially observed tensor contaminated by a sparse tensor, has received increasing attention. In this paper, by leveraging the superior expression of the fully-connected tensor network (FCTN) decomposition, we propose a $\textbf{FCTN}$-based $\textbf{r}$obust $\textbf{c}$onvex optimization model (RC-FCTN) for the RTC problem. Then, we rigorously establish the exact recovery guarantee for the RC-FCTN. For solving the constrained optimization model RC-FCTN, we develop an alternating direction method of multipliers (ADMM)-based algorithm, which enjoys the global convergence guarantee. Moreover, we suggest a $\textbf{FCTN}$-based $\textbf{r}$obust $\textbf{n}$on$\textbf{c}$onvex optimization model (RNC-FCTN) for the RTC problem. A proximal alternating minimization (PAM)-based algorithm is developed to solve the proposed RNC-FCTN. Meanwhile, we theoretically derive the convergence of the PAM-based algorithm. Comprehensive numerical experiments in several applications, such as video completion and video background subtraction, demonstrate that proposed methods are superior to several state-of-the-art methods.

[32]  arXiv:2110.08762 [pdf, other]
Title: Inconsistency-aware Uncertainty Estimation for Semi-supervised Medical Image Segmentation
Comments: Accepted by IEEE Transactions on Medical Imaging (TMI)
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In semi-supervised medical image segmentation, most previous works draw on the common assumption that higher entropy means higher uncertainty. In this paper, we investigate a novel method of estimating uncertainty. We observe that, when assigned different misclassification costs in a certain degree, if the segmentation result of a pixel becomes inconsistent, this pixel shows a relative uncertainty in its segmentation. Therefore, we present a new semi-supervised segmentation model, namely, conservative-radical network (CoraNet in short) based on our uncertainty estimation and separate self-training strategy. In particular, our CoraNet model consists of three major components: a conservative-radical module (CRM), a certain region segmentation network (C-SN), and an uncertain region segmentation network (UC-SN) that could be alternatively trained in an end-to-end manner. We have extensively evaluated our method on various segmentation tasks with publicly available benchmark datasets, including CT pancreas, MR endocardium, and MR multi-structures segmentation on the ACDC dataset. Compared with the current state of the art, our CoraNet has demonstrated superior performance. In addition, we have also analyzed its connection with and difference from conventional methods of uncertainty estimation in semi-supervised medical image segmentation.

[33]  arXiv:2110.08774 [pdf, other]
Title: Nonlinear Transform Induced Tensor Nuclear Norm for Tensor Completion
Comments: Nonlinear transform, tensor nuclear norm, proximal alternating minimization, tensor completion
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

The linear transform-based tensor nuclear norm (TNN) methods have recently obtained promising results for tensor completion. The main idea of this type of methods is exploiting the low-rank structure of frontal slices of the targeted tensor under the linear transform along the third mode. However, the low-rankness of frontal slices is not significant under linear transforms family. To better pursue the low-rank approximation, we propose a nonlinear transform-based TNN (NTTNN). More concretely, the proposed nonlinear transform is a composite transform consisting of the linear semi-orthogonal transform along the third mode and the element-wise nonlinear transform on frontal slices of the tensor under the linear semi-orthogonal transform, which are indispensable and complementary in the composite transform to fully exploit the underlying low-rankness. Based on the suggested low-rankness metric, i.e., NTTNN, we propose a low-rank tensor completion (LRTC) model. To tackle the resulting nonlinear and nonconvex optimization model, we elaborately design the proximal alternating minimization (PAM) algorithm and establish the theoretical convergence guarantee of the PAM algorithm. Extensive experimental results on hyperspectral images, multispectral images, and videos show that the our method outperforms linear transform-based state-of-the-art LRTC methods qualitatively and quantitatively.

[34]  arXiv:2110.08787 [pdf, other]
Title: PixelPyramids: Exact Inference Models from Lossless Image Pyramids
Comments: To appear at ICCV 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

Autoregressive models are a class of exact inference approaches with highly flexible functional forms, yielding state-of-the-art density estimates for natural images. Yet, the sequential ordering on the dimensions makes these models computationally expensive and limits their applicability to low-resolution imagery. In this work, we propose Pixel-Pyramids, a block-autoregressive approach employing a lossless pyramid decomposition with scale-specific representations to encode the joint distribution of image pixels. Crucially, it affords a sparser dependency structure compared to fully autoregressive approaches. Our PixelPyramids yield state-of-the-art results for density estimation on various image datasets, especially for high-resolution data. For CelebA-HQ 1024 x 1024, we observe that the density estimates (in terms of bits/dim) are improved to ~44% of the baseline despite sampling speeds superior even to easily parallelizable flow-based models.

[35]  arXiv:2110.08791 [pdf, other]
Title: Taming Visually Guided Sound Generation
Comments: Accepted as an oral presentation for the BMVC 2021. Code: this https URL Project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Recent advances in visually-induced audio generation are based on sampling short, low-fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the-art model takes minutes on a high-end GPU. In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU.
We train a transformer to sample a new spectrogram from the pre-trained spectrogram codebook given the set of video features. The codebook is obtained using a variant of VQGAN trained to produce a compact sampling space with a novel spectrogram-based perceptual loss. The generated spectrogram is transformed into a waveform using a window-based GAN that significantly speeds up generation. Considering the lack of metrics for automatic evaluation of generated spectrograms, we also build a family of metrics called FID and MKL. These metrics are based on a novel sound classifier, called Melception, and designed to evaluate the fidelity and relevance of open-domain samples.
Both qualitative and quantitative studies are conducted on small- and large-scale datasets to evaluate the fidelity and relevance of generated samples. We also compare our model to the state-of-the-art and observe a substantial improvement in quality, size, and computation time. Code, demo, and samples: v-iashin.github.io/SpecVQGAN

[36]  arXiv:2110.08797 [pdf, other]
Title: Towards Language-guided Visual Recognition via Dynamic Convolutions
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we are committed to establishing an unified and end-to-end multi-modal network via exploring the language-guided visual recognition. To approach this target, we first propose a novel multi-modal convolution module called Language-dependent Convolution (LaConv). Its convolution kernels are dynamically generated based on natural language information, which can help extract differentiated visual features for different multi-modal examples. Based on the LaConv module, we further build the first fully language-driven convolution network, termed as LaConvNet, which can unify the visual recognition and multi-modal reasoning in one forward structure. To validate LaConv and LaConvNet, we conduct extensive experiments on four benchmark datasets of two vision-and-language tasks, i.e., visual question answering (VQA) and referring expression comprehension (REC). The experimental results not only shows the performance gains of LaConv compared to the existing multi-modal modules, but also witness the merits of LaConvNet as an unified network, including compact network, high generalization ability and excellent performance, e.g., +4.7% on RefCOCO+.

[37]  arXiv:2110.08805 [pdf, other]
Title: Revealing Disocclusions in Temporal View Synthesis through Infilling Vector Prediction
Comments: WACV 2022. this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We consider the problem of temporal view synthesis, where the goal is to predict a future video frame from the past frames using knowledge of the depth and relative camera motion. In contrast to revealing the disoccluded regions through intensity based infilling, we study the idea of an infilling vector to infill by pointing to a non-disoccluded region in the synthesized view. To exploit the structure of disocclusions created by camera motion during their infilling, we rely on two important cues, temporal correlation of infilling directions and depth. We design a learning framework to predict the infilling vector by computing a temporal prior that reflects past infilling directions and a normalized depth map as input to the network. We conduct extensive experiments on a large scale dataset we build for evaluating temporal view synthesis in addition to the SceneNet RGB-D dataset. Our experiments demonstrate that our infilling vector prediction approach achieves superior quantitative and qualitative infilling performance compared to other approaches in literature.

[38]  arXiv:2110.08814 [pdf, other]
Title: TEAM-Net: Multi-modal Learning for Video Action Recognition with Partial Decoding
Comments: To appear in BMVC 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Most of existing video action recognition models ingest raw RGB frames. However, the raw video stream requires enormous storage and contains significant temporal redundancy. Video compression (e.g., H.264, MPEG-4) reduces superfluous information by representing the raw video stream using the concept of Group of Pictures (GOP). Each GOP is composed of the first I-frame (aka RGB image) followed by a number of P-frames, represented by motion vectors and residuals, which can be regarded and used as pre-extracted features. In this work, we 1) introduce sampling the input for the network from partially decoded videos based on the GOP-level, and 2) propose a plug-and-play mulTi-modal lEArning Module (TEAM) for training the network using information from I-frames and P-frames in an end-to-end manner. We demonstrate the superior performance of TEAM-Net compared to the baseline using RGB only. TEAM-Net also achieves the state-of-the-art performance in the area of video action recognition with partial decoding. Code is provided at https://github.com/villawang/TEAM-Net.

[39]  arXiv:2110.08818 [pdf, other]
Title: MeronymNet: A Hierarchical Approach for Unified and Controllable Multi-Category Object Generation
Comments: Accepted at ACM Multimedia (ACMMM) 2021 [ORAL] . Website : this https URL arXiv admin note: text overlap with arXiv:2006.00190
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)

We introduce MeronymNet, a novel hierarchical approach for controllable, part-based generation of multi-category objects using a single unified model. We adopt a guided coarse-to-fine strategy involving semantically conditioned generation of bounding box layouts, pixel-level part layouts and ultimately, the object depictions themselves. We use Graph Convolutional Networks, Deep Recurrent Networks along with custom-designed Conditional Variational Autoencoders to enable flexible, diverse and category-aware generation of 2-D objects in a controlled manner. The performance scores for generated objects reflect MeronymNet's superior performance compared to multiple strong baselines and ablative variants. We also showcase MeronymNet's suitability for controllable object generation and interactive object editing at various levels of structural and semantic granularity.

[40]  arXiv:2110.08822 [pdf, other]
Title: Siamese Transformer Pyramid Networks for Real-Time UAV Tracking
Comments: 10 pages, 8 figures, accepted by WACV2022
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Recent object tracking methods depend upon deep networks or convoluted architectures. Most of those trackers can hardly meet real-time processing requirements on mobile platforms with limited computing resources. In this work, we introduce the Siamese Transformer Pyramid Network (SiamTPN), which inherits the advantages from both CNN and Transformer architectures. Specifically, we exploit the inherent feature pyramid of a lightweight network (ShuffleNetV2) and reinforce it with a Transformer to construct a robust target-specific appearance model. A centralized architecture with lateral cross attention is developed for building augmented high-level feature maps. To avoid the computation and memory intensity while fusing pyramid representations with the Transformer, we further introduce the pooling attention module, which significantly reduces memory and time complexity while improving the robustness. Comprehensive experiments on both aerial and prevalent tracking benchmarks achieve competitive results while operating at high speed, demonstrating the effectiveness of SiamTPN. Moreover, our fastest variant tracker operates over 30 Hz on a single CPU-core and obtaining an AUC score of 58.1% on the LaSOT dataset. Source codes are available at https://github.com/RISCNYUAD/SiamTPNTracker

[41]  arXiv:2110.08825 [pdf, other]
Title: Localization with Sampling-Argmax
Comments: NeurIPS 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Soft-argmax operation is commonly adopted in detection-based methods to localize the target position in a differentiable manner. However, training the neural network with soft-argmax makes the shape of the probability map unconstrained. Consequently, the model lacks pixel-wise supervision through the map during training, leading to performance degradation. In this work, we propose sampling-argmax, a differentiable training method that imposes implicit constraints to the shape of the probability map by minimizing the expectation of the localization error. To approximate the expectation, we introduce a continuous formulation of the output distribution and develop a differentiable sampling process. The expectation can be approximated by calculating the average error of all samples drawn from the output distribution. We show that sampling-argmax can seamlessly replace the conventional soft-argmax operation on various localization tasks. Comprehensive experiments demonstrate the effectiveness and flexibility of the proposed method. Code is available at https://github.com/Jeff-sjtu/sampling-argmax

[42]  arXiv:2110.08828 [pdf]
Title: Compression-aware Projection with Greedy Dimension Reduction for Convolutional Neural Network Activations
Comments: 5 pages, 5 figures, submitted to 2022 ICASSP
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)

Convolutional neural networks (CNNs) achieve remarkable performance in a wide range of fields. However, intensive memory access of activations introduces considerable energy consumption, impeding deployment of CNNs on resourceconstrained edge devices. Existing works in activation compression propose to transform feature maps for higher compressibility, thus enabling dimension reduction. Nevertheless, in the case of aggressive dimension reduction, these methods lead to severe accuracy drop. To improve the trade-off between classification accuracy and compression ratio, we propose a compression-aware projection system, which employs a learnable projection to compensate for the reconstruction loss. In addition, a greedy selection metric is introduced to optimize the layer-wise compression ratio allocation by considering both accuracy and #bits reduction simultaneously. Our test results show that the proposed methods effectively reduce 2.91x~5.97x memory access with negligible accuracy drop on MobileNetV2/ResNet18/VGG16.

[43]  arXiv:2110.08842 [pdf]
Title: Exploring Novel Pooling Strategies for Edge Preserved Feature Maps in Convolutional Neural Networks
Comments: 29 pages, Submitted to Elsevier Pattern Recognition for review
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

With the introduction of anti-aliased convolutional neural networks (CNN), there has been some resurgence in relooking the way pooling is done in CNNs. The fundamental building block of the anti-aliased CNN has been the application of Gaussian smoothing before the pooling operation to reduce the distortion due to aliasing thereby making CNNs shift invariant. Wavelet based approaches have also been proposed as a possibility of additional noise removal capability and gave interesting results for even segmentation tasks. However, all the approaches proposed completely remove the high frequency components under the assumption that they are noise. However, by removing high frequency components, the edges in the feature maps are also smoothed. In this work, an exhaustive analysis of the edge preserving pooling options for classification, segmentation and autoencoders are presented. Two novel pooling approaches are presented such as Laplacian-Gaussian Concatenation with Attention (LGCA) pooling and Wavelet based approximate-detailed coefficient concatenation with attention (WADCA) pooling. The results suggest that the proposed pooling approaches outperform the conventional pooling as well as blur pooling for classification, segmentation and autoencoders.

[44]  arXiv:2110.08855 [pdf, other]
Title: Online Continual Learning Via Candidates Voting
Comments: Accepted paper at Winter Conference on Applications of Computer Vision (WACV 2022)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Continual learning in online scenario aims to learn a sequence of new tasks from data stream using each data only once for training, which is more realistic than in offline mode assuming data from new task are all available. However, this problem is still under-explored for the challenging class-incremental setting in which the model classifies all classes seen so far during inference. Particularly, performance struggles with increased number of tasks or additional classes to learn for each task. In addition, most existing methods require storing original data as exemplars for knowledge replay, which may not be feasible for certain applications with limited memory budget or privacy concerns. In this work, we introduce an effective and memory-efficient method for online continual learning under class-incremental setting through candidates selection from each learned task together with prior incorporation using stored feature embeddings instead of original data as exemplars. Our proposed method implemented for image classification task achieves the best results under different benchmark datasets for online continual learning including CIFAR-10, CIFAR-100 and CORE-50 while requiring much less memory resource compared with existing works.

[45]  arXiv:2110.08861 [pdf, other]
Title: 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers
Journal-ref: BMVC 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

3D reconstruction aims to reconstruct 3D objects from 2D views. Previous works for 3D reconstruction mainly focus on feature matching between views or using CNNs as backbones. Recently, Transformers have been shown effective in multiple applications of computer vision. However, whether or not Transformers can be used for 3D reconstruction is still unclear. In this paper, we fill this gap by proposing 3D-RETR, which is able to perform end-to-end 3D REconstruction with TRansformers. 3D-RETR first uses a pretrained Transformer to extract visual features from 2D input images. 3D-RETR then uses another Transformer Decoder to obtain the voxel features. A CNN Decoder then takes as input the voxel features to obtain the reconstructed objects. 3D-RETR is capable of 3D reconstruction from a single view or multiple views. Experimental results on two datasets show that 3DRETR reaches state-of-the-art performance on 3D reconstruction. Additional ablation study also demonstrates that 3D-DETR benefits from using Transformers.

[46]  arXiv:2110.08866 [pdf, other]
Title: Alleviating Noisy-label Effects in Image Classification via Probability Transition Matrix
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Deep-learning-based image classification frameworks often suffer from the noisy label problem caused by the inter-observer variation. Recent studies employed learning-to-learn paradigms (e.g., Co-teaching and JoCoR) to filter the samples with noisy labels from the training set. However, most of them use a simple cross-entropy loss as the criterion for noisy label identification. The hard samples, which are beneficial for classifier learning, are often mistakenly treated as noises in such a setting since both the hard samples and ones with noisy labels lead to a relatively larger loss value than the easy cases. In this paper, we propose a plugin module, namely noise ignoring block (NIB), consisting of a probability transition matrix and an inter-class correlation (IC) loss, to separate the hard samples from the mislabeled ones, and further boost the accuracy of image classification network trained with noisy labels. Concretely, our IC loss is calculated as Kullback-Leibler divergence between the network prediction and the accumulative soft label generated by the probability transition matrix. Such that, with the lower value of IC loss, the hard cases can be easily distinguished from mislabeled cases. Extensive experiments are conducted on natural and medical image datasets (CIFAR-10 and ISIC 2019). The experimental results show that our NIB module consistently improves the performances of the state-of-the-art robust training methods.

[47]  arXiv:2110.08872 [pdf, other]
Title: Contrastive Learning of Visual-Semantic Embeddings
Subjects: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)

Contrastive learning is a powerful technique to learn representations that are semantically distinctive and geometrically invariant. While most of the earlier approaches have demonstrated its effectiveness on single-modality learning tasks such as image classification, recently there have been a few attempts towards extending this idea to multi-modal data. In this paper, we propose two loss functions based on normalized cross-entropy to perform the task of learning joint visual-semantic embedding using batch contrastive training. In a batch, for a given anchor point from one modality, we consider its negatives only from another modality, and define our first contrastive loss based on expected violations incurred by all the negatives. Next, we update this loss and define the second contrastive loss based on the violation incurred only by the hardest negative. We compare our results with existing visual-semantic embedding methods on cross-modal image-to-text and text-to-image retrieval tasks using the MS-COCO and Flickr30K datasets, where we outperform the state-of-the-art on the MS-COCO dataset and achieve comparable results on the Flickr30K dataset.

[48]  arXiv:2110.08890 [pdf, other]
Title: Network Augmentation for Tiny Deep Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

We introduce Network Augmentation (NetAug), a new training method for improving the performance of tiny neural networks. Existing regularization techniques (e.g., data augmentation, dropout) have shown much success on large neural networks (e.g., ResNet50) by adding noise to overcome over-fitting. However, we found these techniques hurt the performance of tiny neural networks. We argue that training tiny models are different from large models: rather than augmenting the data, we should augment the model, since tiny models tend to suffer from under-fitting rather than over-fitting due to limited capacity. To alleviate this issue, NetAug augments the network (reverse dropout) instead of inserting noise into the dataset or the network. It puts the tiny model into larger models and encourages it to work as a sub-model of larger models to get extra supervision, in addition to functioning as an independent model. At test time, only the tiny model is used for inference, incurring zero inference overhead. We demonstrate the effectiveness of NetAug on image classification and object detection. NetAug consistently improves the performance of tiny models, achieving up to 2.1% accuracy improvement on ImageNet, and 4.3% on Cars. On Pascal VOC, NetAug provides 2.96% mAP improvement with the same computational cost.

[49]  arXiv:2110.08893 [pdf, other]
Title: Temporally stable video segmentation without video annotations
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Temporally consistent dense video annotations are scarce and hard to collect. In contrast, image segmentation datasets (and pre-trained models) are ubiquitous, and easier to label for any novel task. In this paper, we introduce a method to adapt still image segmentation models to video in an unsupervised manner, by using an optical flow-based consistency measure. To ensure that the inferred segmented videos appear more stable in practice, we verify that the consistency measure is well correlated with human judgement via a user study. Training a new multi-input multi-output decoder using this measure as a loss, together with a technique for refining current image segmentation datasets and a temporal weighted-guided filter, we observe stability improvements in the generated segmented videos with minimal loss of accuracy.

[50]  arXiv:2110.08934 [pdf, other]
Title: On the Effect of Selfie Beautification Filters on Face Detection and Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Beautification and augmented reality filters are very popular in applications that use selfie images captured with smartphones or personal devices. However, they can distort or modify biometric features, severely affecting the capability of recognizing individuals' identity or even detecting the face. Accordingly, we address the effect of such filters on the accuracy of automated face detection and recognition. The social media image filters studied either modify the image contrast or illumination or occlude parts of the face with for example artificial glasses or animal noses. We observe that the effect of some of these filters is harmful both to face detection and identity recognition, specially if they obfuscate the eye or (to a lesser extent) the nose. To counteract such effect, we develop a method to reconstruct the applied manipulation with a modified version of the U-NET segmentation network. This is observed to contribute to a better face detection and recognition accuracy. From a recognition perspective, we employ distance measures and trained machine learning algorithms applied to features extracted using a ResNet-34 network trained to recognize faces. We also evaluate if incorporating filtered images to the training set of machine learning approaches are beneficial for identity recognition. Our results show good recognition when filters do not occlude important landmarks, specially the eyes (identification accuracy >99%, EER<2%). The combined effect of the proposed approaches also allow to mitigate the effect produced by filters that occlude parts of the face, achieving an identification accuracy of >92% with the majority of perturbations evaluated, and an EER <8%. Although there is room for improvement, when neither U-NET reconstruction nor training with filtered images is applied, the accuracy with filters that severely occlude the eye is <72% (identification) and >12% (EER)

[51]  arXiv:2110.08935 [pdf, other]
Title: InfAnFace: Bridging the infant-adult domain gap in facial landmark estimation in the wild
Subjects: Computer Vision and Pattern Recognition (cs.CV)

There is promising potential in the application of algorithmic facial landmark estimation to the early prediction, in infants, of pediatric developmental disorders and other conditions. However, the performance of these deep learning algorithms is severely hampered by the scarcity of infant data. To spur the development of facial landmarking systems for infants, we introduce InfAnFace, a diverse, richly-annotated dataset of infant faces. We use InfAnFace to benchmark the performance of existing facial landmark estimation algorithms that are trained on adult faces and demonstrate there is a significant domain gap between the representations learned by these algorithms when applied on infant vs. adult faces. Finally, we put forward the next potential steps to bridge that gap.

[52]  arXiv:2110.08940 [pdf, other]
Title: Dynamic Slimmable Denoising Network
Comments: 11 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Recently, tremendous human-designed and automatically searched neural networks have been applied to image denoising. However, previous works intend to handle all noisy images in a pre-defined static network architecture, which inevitably leads to high computational complexity for good denoising quality. Here, we present dynamic slimmable denoising network (DDS-Net), a general method to achieve good denoising quality with less computational complexity, via dynamically adjusting the channel configurations of networks at test time with respect to different noisy images. Our DDS-Net is empowered with the ability of dynamic inference by a dynamic gate, which can predictively adjust the channel configuration of networks with negligible extra computation cost. To ensure the performance of each candidate sub-network and the fairness of the dynamic gate, we propose a three-stage optimization scheme. In the first stage, we train a weight-shared slimmable super network. In the second stage, we evaluate the trained slimmable super network in an iterative way and progressively tailor the channel numbers of each layer with minimal denoising quality drop. By a single pass, we can obtain several sub-networks with good performance under different channel configurations. In the last stage, we identify easy and hard samples in an online way and train a dynamic gate to predictively select the corresponding sub-network with respect to different noisy images. Extensive experiments demonstrate our DDS-Net consistently outperforms the state-of-the-art individually trained static denoising networks.

[53]  arXiv:2110.08954 [pdf, other]
Title: Uncertainty-Aware Semi-Supervised Few Shot Segmentation
Comments: 9 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Few shot segmentation (FSS) aims to learn pixel-level classification of a target object in a query image using only a few annotated support samples. This is challenging as it requires modeling appearance variations of target objects and the diverse visual cues between query and support images with limited information. To address this problem, we propose a semi-supervised FSS strategy that leverages additional prototypes from unlabeled images with uncertainty guided pseudo label refinement. To obtain reliable prototypes from unlabeled images, we meta-train a neural network to jointly predict segmentation and estimate the uncertainty of predictions. We employ the uncertainty estimates to exclude predictions with high degrees of uncertainty for pseudo label construction to obtain additional prototypes based on the refined pseudo labels. During inference, query segmentation is predicted using prototypes from both support and unlabeled images including low-level features of the query images. Our approach is end-to-end and can easily supplement existing approaches without the requirement of additional training to employ unlabeled samples. Extensive experiments on PASCAL-$5^i$ and COCO-$20^i$ demonstrate that our model can effectively remove unreliable predictions to refine pseudo labels and significantly improve upon state-of-the-art performances.

[54]  arXiv:2110.08955 [pdf]
Title: Predicting Rebar Endpoints using Sin Exponential Regression Model
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Currently, unmanned automation studies are underway to minimize the loss rate of rebar production and the time and accuracy of calibration when producing defective products in the cutting process of processing rebar factories. In this paper, we propose a method to detect and track rebar endpoint images entering the machine vision camera based on YOLO (You Only Look Once)v3, and to predict rebar endpoint in advance with sin exponential regression of acquired coordinates. The proposed method solves the problem of large prediction error rates for frame locations where rebar endpoints are far away in OPPDet (Object Position Prediction Detect) models, which prepredict rebar endpoints with improved results showing 0.23 to 0.52% less error rates at sin exponential regression prediction points.

[55]  arXiv:2110.08985 [pdf, other]
Title: StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis
Comments: 24 pages, 19 figures. Project page: this http URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

We propose StyleNeRF, a 3D-aware generative model for photo-realistic high-resolution image synthesis with high multi-view consistency, which can be trained on unstructured 2D images. Existing approaches either cannot synthesize high-resolution images with fine details or yield noticeable 3D-inconsistent artifacts. In addition, many of them lack control over style attributes and explicit 3D camera poses. StyleNeRF integrates the neural radiance field (NeRF) into a style-based generator to tackle the aforementioned challenges, i.e., improving rendering efficiency and 3D consistency for high-resolution image generation. We perform volume rendering only to produce a low-resolution feature map and progressively apply upsampling in 2D to address the first issue. To mitigate the inconsistencies caused by 2D upsampling, we propose multiple designs, including a better upsampler and a new regularization loss. With these designs, StyleNeRF can synthesize high-resolution images at interactive rates while preserving 3D consistency at high quality. StyleNeRF also enables control of camera poses and different levels of styles, which can generalize to unseen views. It also supports challenging tasks, including zoom-in and-out, style mixing, inversion, and semantic editing.

[56]  arXiv:2110.08988 [pdf, other]
Title: FEANet: Feature-Enhanced Attention Network for RGB-Thermal Real-time Semantic Segmentation
Comments: 7 pages, 5 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The RGB-Thermal (RGB-T) information for semantic segmentation has been extensively explored in recent years. However, most existing RGB-T semantic segmentation usually compromises spatial resolution to achieve real-time inference speed, which leads to poor performance. To better extract detail spatial information, we propose a two-stage Feature-Enhanced Attention Network (FEANet) for the RGB-T semantic segmentation task. Specifically, we introduce a Feature-Enhanced Attention Module (FEAM) to excavate and enhance multi-level features from both the channel and spatial views. Benefited from the proposed FEAM module, our FEANet can preserve the spatial information and shift more attention to high-resolution features from the fused RGB-T images. Extensive experiments on the urban scene dataset demonstrate that our FEANet outperforms other state-of-the-art (SOTA) RGB-T methods in terms of objective metrics and subjective visual comparison (+2.6% in global mAcc and +0.8% in global mIoU). For the 480 x 640 RGB-T test images, our FEANet can run with a real-time speed on an NVIDIA GeForce RTX 2080 Ti card.

[57]  arXiv:2110.08994 [pdf, other]
Title: CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification
Comments: 11 pages, 7 figures, 7 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Visible-infrared cross-modality person re-identification is a challenging ReID task, which aims to retrieve and match the same identity's images between the heterogeneous visible and infrared modalities. Thus, the core of this task is to bridge the huge gap between these two modalities. The existing convolutional neural network-based methods mainly face the problem of insufficient perception of modalities' information, and can not learn good discriminative modality-invariant embeddings for identities, which limits their performance. To solve these problems, we propose a cross-modality transformer-based method (CMTR) for the visible-infrared person re-identification task, which can explicitly mine the information of each modality and generate better discriminative features based on it. Specifically, to capture modalities' characteristics, we design the novel modality embeddings, which are fused with token embeddings to encode modalities' information. Furthermore, to enhance representation of modality embeddings and adjust matching embeddings' distribution, we propose a modality-aware enhancement loss based on the learned modalities' information, reducing intra-class distance and enlarging inter-class distance. To our knowledge, this is the first work of applying transformer network to the cross-modality re-identification task. We implement extensive experiments on the public SYSU-MM01 and RegDB datasets, and our proposed CMTR model's performance significantly surpasses existing outstanding CNN-based methods.

[58]  arXiv:2110.09004 [pdf, other]
Title: NYU-VPR: Long-Term Visual Place Recognition Benchmark with View Direction and Data Anonymization Influences
Comments: 7 pages, 10 figures, published in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2021)
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Visual place recognition (VPR) is critical in not only localization and mapping for autonomous driving vehicles, but also assistive navigation for the visually impaired population. To enable a long-term VPR system on a large scale, several challenges need to be addressed. First, different applications could require different image view directions, such as front views for self-driving cars while side views for the low vision people. Second, VPR in metropolitan scenes can often cause privacy concerns due to the imaging of pedestrian and vehicle identity information, calling for the need for data anonymization before VPR queries and database construction. Both factors could lead to VPR performance variations that are not well understood yet. To study their influences, we present the NYU-VPR dataset that contains more than 200,000 images over a 2km by 2km area near the New York University campus, taken within the whole year of 2016. We present benchmark results on several popular VPR algorithms showing that side views are significantly more challenging for current VPR methods while the influence of data anonymization is almost negligible, together with our hypothetical explanations and in-depth analysis.

[59]  arXiv:2110.09006 [pdf, other]
Title: Natural Image Reconstruction from fMRI using Deep Learning: A Survey
Subjects: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)

With the advent of brain imaging techniques and machine learning tools, much effort has been devoted to building computational models to capture the encoding of visual information in the human brain. One of the most challenging brain decoding tasks is the accurate reconstruction of the perceived natural images from brain activities measured by functional magnetic resonance imaging (fMRI). In this work, we survey the most recent deep learning methods for natural image reconstruction from fMRI. We examine these methods in terms of architectural design, benchmark datasets, and evaluation metrics and present a fair performance evaluation across standardized evaluation metrics. Finally, we discuss the strengths and limitations of existing studies and present potential future directions.

[60]  arXiv:2110.09028 [pdf]
Title: Fast tree skeleton extraction using voxel thinning based on tree point cloud
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Tree skeleton plays an important role in tree structure analysis, forest inventory and ecosystem monitoring. However, it is a challenge to extract a skeleton from a tree point cloud with complex branches. In this paper, an automatic and fast tree skeleton extraction method (FTSEM) based on voxel thinning is proposed. In this method, a wood-leaf classification algorithm was introduced to filter leaf points for the reduction of the leaf interference on tree skeleton generation, tree voxel thinning was adopted to extract raw tree skeleton quickly, and a breakpoint connection algorithm was used to improve the skeleton connectivity and completeness. Experiments were carried out in Haidian Park, Beijing, in which 24 trees were scanned and processed to obtain tree skeletons. The graph search algorithm (GSA) is used to extract tree skeletons based on the same datasets. Compared with GSA method, the FTSEM method obtained more complete tree skeletons. And the time cost of the FTSEM method is evaluated using the runtime and time per million points (TPMP). The runtime of FTSEM is from 1.0 s to 13.0 s, and the runtime of GSA is from 6.4 s to 309.3 s. The average value of TPMP is 1.8 s for FTSEM, and 22.3 s for GSA respectively. The experimental results demonstrate that the proposed method is feasible, robust, and fast with a good potential on tree skeleton extraction.

[61]  arXiv:2110.09047 [pdf, other]
Title: Abnormal Occupancy Grid Map Recognition using Attention Network
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The occupancy grid map is a critical component of autonomous positioning and navigation in the mobile robotic system, as many other systems' performance depends heavily on it. To guarantee the quality of the occupancy grid maps, researchers previously had to perform tedious manual recognition for a long time. This work focuses on automatic abnormal occupancy grid map recognition using the residual neural networks and a novel attention mechanism module. We propose an effective channel and spatial Residual SE(csRSE) attention module, which contains a residual block for producing hierarchical features, followed by both channel SE (cSE) block and spatial SE (sSE) block for the sufficient information extraction along the channel and spatial pathways. To further summarize the occupancy grid map characteristics and experiment with our csRSE attention modules, we constructed a dataset called occupancy grid map dataset (OGMD) for our experiments. On this OGMD test dataset, we tested few variants of our proposed structure and compared them with other attention mechanisms. Our experimental results show that the proposed attention network can infer the abnormal map with state-of-the-art (SOTA) accuracy of 96.23% for abnormal occupancy grid map recognition.

[62]  arXiv:2110.09060 [pdf, other]
Title: Discovery-and-Selection: Towards Optimal Multiple Instance Learning for Weakly Supervised Object Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Weakly supervised object detection (WSOD) is a challenging task that requires simultaneously learn object classifiers and estimate object locations under the supervision of image category labels. A major line of WSOD methods roots in multiple instance learning which regards images as bags of instance and selects positive instances from each bag to learn the detector. However, a grand challenge emerges when the detector inclines to converge to discriminative parts of objects rather than the whole objects. In this paper, under the hypothesis that optimal solutions are included in local minima, we propose a discoveryand-selection approach fused with multiple instance learning (DS-MIL), which finds rich local minima and select optimal solutions from multiple local minima. To implement DS-MIL, an attention module is designed so that more context information can be captured by feature maps and more valuable proposals can be collected during training. With proposal candidates, a re-rank module is designed to select informative instances for object detector training. Experimental results on commonly used benchmarks show that our proposed DS-MIL approach can consistently improve the baselines, reporting state-of-the-art performance.

[63]  arXiv:2110.09067 [pdf, other]
Title: Unsupervised Shot Boundary Detection for Temporal Segmentation of Long Capsule Endoscopy Videos
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Physicians use Capsule Endoscopy (CE) as a non-invasive and non-surgical procedure to examine the entire gastrointestinal (GI) tract for diseases and abnormalities. A single CE examination could last between 8 to 11 hours generating up to 80,000 frames which is compiled as a video. Physicians have to review and analyze the entire video to identify abnormalities or diseases before making diagnosis. This review task can be very tedious, time consuming and prone to error. While only as little as a single frame may capture useful content that is relevant to the physicians' final diagnosis, frames covering the small bowel region alone could be as much as 50,000. To minimize physicians' review time and effort, this paper proposes a novel unsupervised and computationally efficient temporal segmentation method to automatically partition long CE videos into a homogeneous and identifiable video segments. However, the search for temporal boundaries in a long video using high dimensional frame-feature matrix is computationally prohibitive and impracticable for real clinical application. Therefore, leveraging both spatial and temporal information in the video, we first extracted high level frame features using a pretrained CNN model and then projected the high-dimensional frame-feature matrix to lower 1-dimensional embedding. Using this 1-dimensional sequence embedding, we applied the Pruned Exact Linear Time (PELT) algorithm to searched for temporal boundaries that indicates the transition points from normal to abnormal frames and vice-versa. We experimented with multiple real patients' CE videos and our model achieved an AUC of 66\% on multiple test videos against expert provided labels.

[64]  arXiv:2110.09075 [pdf, other]
Title: Boosting the Transferability of Video Adversarial Examples via Temporal Translation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Although deep-learning based video recognition models have achieved remarkable success, they are vulnerable to adversarial examples that are generated by adding human-imperceptible perturbations on clean video samples. As indicated in recent studies, adversarial examples are transferable, which makes it feasible for black-box attacks in real-world applications. Nevertheless, most existing adversarial attack methods have poor transferability when attacking other video models and transfer-based attacks on video models are still unexplored. To this end, we propose to boost the transferability of video adversarial examples for black-box attacks on video recognition models. Through extensive analysis, we discover that different video recognition models rely on different discriminative temporal patterns, leading to the poor transferability of video adversarial examples. This motivates us to introduce a temporal translation attack method, which optimizes the adversarial perturbations over a set of temporal translated video clips. By generating adversarial examples over translated videos, the resulting adversarial examples are less sensitive to temporal patterns existed in the white-box model being attacked and thus can be better transferred. Extensive experiments on the Kinetics-400 dataset and the UCF-101 dataset demonstrate that our method can significantly boost the transferability of video adversarial examples. For transfer-based attack against video recognition models, it achieves a 61.56% average attack success rate on the Kinetics-400 and 48.60% on the UCF-101.

[65]  arXiv:2110.09107 [pdf, other]
Title: Differentiable Rendering with Perturbed Optimizers
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Reasoning about 3D scenes from their 2D image projections is one of the core problems in computer vision. Solutions to this inverse and ill-posed problem typically involve a search for models that best explain observed image data. Notably, images depend both on the properties of observed scenes and on the process of image formation. Hence, if optimization techniques should be used to explain images, it is crucial to design differentiable functions for the projection of 3D scenes into images, also known as differentiable rendering. Previous approaches to differentiable rendering typically replace non-differentiable operations by smooth approximations, impacting the subsequent 3D estimation. In this paper, we take a more general approach and study differentiable renderers through the prism of randomized optimization and the related notion of perturbed optimizers. In particular, our work highlights the link between some well-known differentiable renderer formulations and randomly smoothed optimizers, and introduces differentiable perturbed renderers. We also propose a variance reduction mechanism to alleviate the computational burden inherent to perturbed optimizers and introduce an adaptive scheme to automatically adjust the smoothing parameters of the rendering process. We apply our method to 3D scene reconstruction and demonstrate its advantages on the tasks of 6D pose estimation and 3D mesh reconstruction. By providing informative gradients that can be used as a strong supervisory signal, we demonstrate the benefits of perturbed renderers to obtain more accurate solutions when compared to the state-of-the-art alternatives using smooth gradient approximations.

[66]  arXiv:2110.09108 [pdf, other]
Title: Asymmetric Modality Translation For Face Presentation Attack Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Face presentation attack detection (PAD) is an essential measure to protect face recognition systems from being spoofed by malicious users and has attracted great attention from both academia and industry. Although most of the existing methods can achieve desired performance to some extent, the generalization issue of face presentation attack detection under cross-domain settings (e.g., the setting of unseen attacks and varying illumination) remains to be solved. In this paper, we propose a novel framework based on asymmetric modality translation for face presentation attack detection in bi-modality scenarios. Under the framework, we establish connections between two modality images of genuine faces. Specifically, a novel modality fusion scheme is presented that the image of one modality is translated to the other one through an asymmetric modality translator, then fused with its corresponding paired image. The fusion result is fed as the input to a discriminator for inference. The training of the translator is supervised by an asymmetric modality translation loss. Besides, an illumination normalization module based on Pattern of Local Gravitational Force (PLGF) representation is used to reduce the impact of illumination variation. We conduct extensive experiments on three public datasets, which validate that our method is effective in detecting various types of attacks and achieves state-of-the-art performance under different evaluation protocols.

[67]  arXiv:2110.09109 [pdf, other]
Title: Patch-Based Deep Autoencoder for Point Cloud Geometry Compression
Authors: Kang You, Pan Gao
Comments: Accepted to ACM Multimedia Asia (MMAsia '21)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)

The ever-increasing 3D application makes the point cloud compression unprecedentedly important and needed. In this paper, we propose a patch-based compression process using deep learning, focusing on the lossy point cloud geometry compression. Unlike existing point cloud compression networks, which apply feature extraction and reconstruction on the entire point cloud, we divide the point cloud into patches and compress each patch independently. In the decoding process, we finally assemble the decompressed patches into a complete point cloud. In addition, we train our network by a patch-to-patch criterion, i.e., use the local reconstruction loss for optimization, to approximate the global reconstruction optimality. Our method outperforms the state-of-the-art in terms of rate-distortion performance, especially at low bitrates. Moreover, the compression process we proposed can guarantee to generate the same number of points as the input. The network model of this method can be easily applied to other point cloud reconstruction problems, such as upsampling.

[68]  arXiv:2110.09110 [pdf, other]
Title: Graph Convolution Neural Network For Weakly Supervised Abnormality Localization In Long Capsule Endoscopy Videos
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Temporal activity localization in long videos is an important problem. The cost of obtaining frame level label for long Wireless Capsule Endoscopy (WCE) videos is prohibitive. In this paper, we propose an end-to-end temporal abnormality localization for long WCE videos using only weak video level labels. Physicians use Capsule Endoscopy (CE) as a non-surgical and non-invasive method to examine the entire digestive tract in order to diagnose diseases or abnormalities. While CE has revolutionized traditional endoscopy procedures, a single CE examination could last up to 8 hours generating as much as 100,000 frames. Physicians must review the entire video, frame-by-frame, in order to identify the frames capturing relevant abnormality. This, sometimes could be as few as just a single frame. Given this very high level of redundancy, analyzing long CE videos can be very tedious, time consuming and also error prone. This paper presents a novel multi-step method for an end-to-end localization of target frames capturing abnormalities of interest in the long video using only weak video labels. First we developed an automatic temporal segmentation using change point detection technique to temporally segment the video into uniform, homogeneous and identifiable segments. Then we employed Graph Convolutional Neural Network (GCNN) to learn a representation of each video segment. Using weak video segment labels, we trained our GCNN model to recognize each video segment as abnormal if it contains at least a single abnormal frame. Finally, leveraging the parameters of the trained GCNN model, we replaced the final layer of the network with a temporal pool layer to localize the relevant abnormal frames within each abnormal video segment. Our method achieved an accuracy of 89.9\% on the graph classification task and a specificity of 97.5\% on the abnormal frames localization task.

[69]  arXiv:2110.09129 [pdf, other]
Title: Deep Models with Fusion Strategies for MVP Point Cloud Registration
Comments: Point cloud registration competition, ICCV21 workshop. Substantial text overlap with arXiv:2107.02583
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The main goal of point cloud registration in Multi-View Partial (MVP) Challenge 2021 is to estimate a rigid transformation to align a point cloud pair. The pairs in this competition have the characteristics of low overlap, non-uniform density, unrestricted rotations and ambiguity, which pose a huge challenge to the registration task. In this report, we introduce our solution to the registration task, which fuses two deep learning models: ROPNet and PREDATOR, with customized ensemble strategies. Finally, we achieved the second place in the registration track with 2.96546, 0.02632 and 0.07808 under the the metrics of Rot\_Error, Trans\_Error and MSE, respectively.

[70]  arXiv:2110.09144 [pdf, other]
Title: SynCoLFinGer: Synthetic Contactless Fingerprint Generator
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We present the first method for synthetic generation of contactless fingerprint images, referred to as SynCoLFinGer. To this end, the constituent components of contactless fingerprint images regarding capturing, subject characteristics, and environmental influences are modeled and applied to a synthetically generated ridge pattern using the SFinGe algorithm. The proposed method is able to generate different synthetic samples corresponding to a single finger and it can be parameterized to generate contactless fingerprint images of various quality levels. The resemblance of the synthetically generated contactless fingerprints to real fingerprints is confirmed by evaluating biometric sample quality using an adapted NFIQ 2.0 algorithm and biometric utility using a state-of-the-art contactless fingerprint recognition system.

[71]  arXiv:2110.09157 [pdf, other]
Title: Disentangled Representation with Dual-stage Feature Learning for Face Anti-spoofing
Comments: WACV 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)

As face recognition is widely used in diverse security-critical applications, the study of face anti-spoofing (FAS) has attracted more and more attention. Several FAS methods have achieved promising performances if the attack types in the testing data are the same as training data, while the performance significantly degrades for unseen attack types. It is essential to learn more generalized and discriminative features to prevent overfitting to pre-defined spoof attack types. This paper proposes a novel dual-stage disentangled representation learning method that can efficiently untangle spoof-related features from irrelevant ones. Unlike previous FAS disentanglement works with one-stage architecture, we found that the dual-stage training design can improve the training stability and effectively encode the features to detect unseen attack types. Our experiments show that the proposed method provides superior accuracy than the state-of-the-art methods on several cross-type FAS benchmarks.

[72]  arXiv:2110.09168 [pdf, other]
Title: Domain Generalisation for Apparent Emotional Facial Expression Recognition across Age-Groups
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Apparent emotional facial expression recognition has attracted a lot of research attention recently. However, the majority of approaches ignore age differences and train a generic model for all ages. In this work, we study the effect of using different age-groups for training apparent emotional facial expression recognition models. To this end, we study Domain Generalisation in the context of apparent emotional facial expression recognition from facial imagery across different age groups. We first compare several domain generalisation algorithms on the basis of out-of-domain-generalisation, and observe that the Class-Conditional Domain-Adversarial Neural Networks (CDANN) algorithm has the best performance. We then study the effect of variety and number of age-groups used during training on generalisation to unseen age-groups and observe that an increase in the number of training age-groups tends to increase the apparent emotional facial expression recognition performance on unseen age-groups. We also show that exclusion of an age-group during training tends to affect more the performance of the neighbouring age groups.

[73]  arXiv:2110.09170 [pdf, other]
Title: Continuation of Famous Art with AI: A Conditional Adversarial Network Inpainting Approach
Authors: Jordan J. Bird
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

Much of the state-of-the-art in image synthesis inspired by real artwork are either entirely generative by filtered random noise or inspired by the transfer of style. This work explores the application of image inpainting to continue famous artworks and produce generative art with a Conditional GAN. During the training stage of the process, the borders of images are cropped, leaving only the centre. An inpainting GAN is then tasked with learning to reconstruct the original image from the centre crop by way of minimising both adversarial and absolute difference losses. Once the network is trained, images are then resized rather than cropped and presented as input to the generator. Following the learning process, the generator then creates new images by continuing from the edges of the original piece. Three experiments are performed with datasets of 4766 landscape paintings (impressionism and romanticism), 1167 Ukiyo-e works from the Japanese Edo period, and 4968 abstract artworks. Results show that geometry and texture (including canvas and paint) as well as scenery such as sky, clouds, water, land (including hills and mountains), grass, and flowers are implemented by the generator when extending real artworks. In the Ukiyo-e experiments, it was observed that features such as written text were generated even in cases where the original image did not have any, due to the presence of an unpainted border within the input image.

[74]  arXiv:2110.09195 [pdf, other]
Title: Sub-bit Neural Networks: Learning to Compress and Accelerate Binary Neural Networks
Comments: ICCV 2021. Code and models: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In the low-bit quantization field, training Binary Neural Networks (BNNs) is the extreme solution to ease the deployment of deep models on resource-constrained devices, having the lowest storage cost and significantly cheaper bit-wise operations compared to 32-bit floating-point counterparts. In this paper, we introduce Sub-bit Neural Networks (SNNs), a new type of binary quantization design tailored to compress and accelerate BNNs. SNNs are inspired by an empirical observation, showing that binary kernels learnt at convolutional layers of a BNN model are likely to be distributed over kernel subsets. As a result, unlike existing methods that binarize weights one by one, SNNs are trained with a kernel-aware optimization framework, which exploits binary quantization in the fine-grained convolutional kernel space. Specifically, our method includes a random sampling step generating layer-specific subsets of the kernel space, and a refinement step learning to adjust these subsets of binary kernels via optimization. Experiments on visual recognition benchmarks and the hardware deployment on FPGA validate the great potentials of SNNs. For instance, on ImageNet, SNNs of ResNet-18/ResNet-34 with 0.56-bit weights achieve 3.13/3.33 times runtime speed-up and 1.8 times compression over conventional BNNs with moderate drops in recognition accuracy. Promising results are also obtained when applying SNNs to binarize both weights and activations. Our code is available at https://github.com/yikaiw/SNN.

[75]  arXiv:2110.09202 [pdf, other]
Title: Finding Strong Gravitational Lenses Through Self-Attention
Comments: 11 Pages, 4 tables and 10 Figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM)

The upcoming large scale surveys are expected to find approximately $10^5$ strong gravitational systems by analyzing data of many orders of magnitude than the current era. In this scenario, non-automated techniques will be highly challenging and time-consuming. We propose a new automated architecture based on the principle of self-attention to find strong gravitational lensing. The advantages of self-attention based encoder models over convolution neural networks are investigated and encoder models are analyzed to optimize performance. We constructed 21 self-attention based encoder models and four convolution neural networks trained to identify gravitational lenses from the Bologna Lens Challenge. Each model is trained separately using 18,000 simulated images, cross-validated using 2 000 images, and then applied to a test set with 100 000 images. We used four different metrics for evaluation: classification accuracy, the area under the receiver operating characteristic curve (AUROC), the $TPR_0$ score and the $TPR_{10}$ score. The performance of the self-attention based encoder models and CNN's participated in the challenge are compared. The encoder models performed better than the CNNs and surpassed the CNN models that participated in the bologna lens challenge by a high margin for the $TPR_0$ and $TPR_{10}$. In terms of the AUROC, the encoder models scored equivalent to the top CNN model by only using one-sixth parameters to that of the CNN. Self-Attention based models have a clear advantage compared to simpler CNNs. A low computational cost and complexity make it a highly competing architecture to currently used residual neural networks. Moreover, introducing the encoder layers can also tackle the over-fitting problem present in the CNN's by acting as effective filters.

[76]  arXiv:2110.09217 [pdf]
Title: Color Image Segmentation Using Multi-Objective Swarm Optimizer and Multi-level Histogram Thresholding
Comments: 11 pages, 6 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

Rapid developments in swarm intelligence optimizers and computer processing abilities make opportunities to design more accurate, stable, and comprehensive methods for color image segmentation. This paper presents a new way for unsupervised image segmentation by combining histogram thresholding methods (Kapur's entropy and Otsu's method) and different multi-objective swarm intelligence algorithms (MOPSO, MOGWO, MSSA, and MOALO) to thresholding 3D histogram of a color image. More precisely, this method first combines the objective function of traditional thresholding algorithms to design comprehensive objective functions then uses multi-objective optimizers to find the best thresholds during the optimization of designed objective functions. Also, our method uses a vector objective function in 3D space that could simultaneously handle the segmentation of entire image color channels with the same thresholds. To optimize this vector objective function, we employ multiobjective swarm optimizers that can optimize multiple objective functions at the same time. Therefore, our method considers dependencies between channels to find the thresholds that satisfy objective functions of color channels (which we name as vector objective function) simultaneously. Segmenting entire color channels with the same thresholds also benefits from the fact that our proposed method needs fewer thresholds to segment the image than other thresholding algorithms; thus, it requires less memory space to save thresholds. It helps a lot when we want to segment many images to many regions. The subjective and objective results show the superiority of this method to traditional thresholding methods that separately threshold histograms of a color image.

[77]  arXiv:2110.09241 [pdf, other]
Title: Video Coding for Machine: Compact Visual Representation Compression for Intelligent Collaborative Analytics
Comments: The first three authors had equal contribution. arXiv admin note: text overlap with arXiv:2106.08512
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Video Coding for Machines (VCM) is committed to bridging to an extent separate research tracks of video/image compression and feature compression, and attempts to optimize compactness and efficiency jointly from a unified perspective of high accuracy machine vision and full fidelity human vision. In this paper, we summarize VCM methodology and philosophy based on existing academia and industrial efforts. The development of VCM follows a general rate-distortion optimization, and the categorization of key modules or techniques is established. From previous works, it is demonstrated that, although existing works attempt to reveal the nature of scalable representation in bits when dealing with machine and human vision tasks, there remains a rare study in the generality of low bit rate representation, and accordingly how to support a variety of visual analytic tasks. Therefore, we investigate a novel visual information compression for the analytics taxonomy problem to strengthen the capability of compact visual representations extracted from multiple tasks for visual analytics. A new perspective of task relationships versus compression is revisited. By keeping in mind the transferability among different machine vision tasks (e.g. high-level semantic and mid-level geometry-related), we aim to support multiple tasks jointly at low bit rates. In particular, to narrow the dimensionality gap between neural network generated features extracted from pixels and a variety of machine vision features/labels (e.g. scene class, segmentation labels), a codebook hyperprior is designed to compress the neural network-generated features. As demonstrated in our experiments, this new hyperprior model is expected to improve feature compression efficiency by estimating the signal entropy more accurately, which enables further investigation of the granularity of abstracting compact features among different tasks.

[78]  arXiv:2110.09243 [pdf, other]
Title: Leveraging MoCap Data for Human Mesh Recovery
Comments: 3DV 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Training state-of-the-art models for human body pose and shape recovery from images or videos requires datasets with corresponding annotations that are really hard and expensive to obtain. Our goal in this paper is to study whether poses from 3D Motion Capture (MoCap) data can be used to improve image-based and video-based human mesh recovery methods. We find that fine-tune image-based models with synthetic renderings from MoCap data can increase their performance, by providing them with a wider variety of poses, textures and backgrounds. In fact, we show that simply fine-tuning the batch normalization layers of the model is enough to achieve large gains. We further study the use of MoCap data for video, and introduce PoseBERT, a transformer module that directly regresses the pose parameters and is trained via masked modeling. It is simple, generic and can be plugged on top of any state-of-the-art image-based model in order to transform it in a video-based model leveraging temporal information. Our experimental results show that the proposed approaches reach state-of-the-art performance on various datasets including 3DPW, MPI-INF-3DHP, MuPoTS-3D, MCB and AIST. Test code and models will be available soon.

[79]  arXiv:2110.09260 [pdf, other]
Title: A Unified Framework for Generalized Low-Shot Medical Image Segmentation with Scarce Data
Comments: Published in IEEE TRANSACTIONS ON MEDICAL IMAGING
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Medical image segmentation has achieved remarkable advancements using deep neural networks (DNNs). However, DNNs often need big amounts of data and annotations for training, both of which can be difficult and costly to obtain. In this work, we propose a unified framework for generalized low-shot (one- and few-shot) medical image segmentation based on distance metric learning (DML). Unlike most existing methods which only deal with the lack of annotations while assuming abundance of data, our framework works with extreme scarcity of both, which is ideal for rare diseases. Via DML, the framework learns a multimodal mixture representation for each category, and performs dense predictions based on cosine distances between the pixels' deep embeddings and the category representations. The multimodal representations effectively utilize the inter-subject similarities and intraclass variations to overcome overfitting due to extremely limited data. In addition, we propose adaptive mixing coefficients for the multimodal mixture distributions to adaptively emphasize the modes better suited to the current input. The representations are implicitly embedded as weights of the fc layer, such that the cosine distances can be computed efficiently via forward propagation. In our experiments on brain MRI and abdominal CT datasets, the proposed framework achieves superior performances for low-shot segmentation towards standard DNN-based (3D U-Net) and classical registration-based (ANTs) methods, e.g., achieving mean Dice coefficients of 81%/69% for brain tissue/abdominal multiorgan segmentation using a single training sample, as compared to 52%/31% and 72%/35% by the U-Net and ANTs, respectively.

[80]  arXiv:2110.09267 [pdf, other]
Title: Boosting Image Outpainting with Semantic Layout Prediction
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The objective of image outpainting is to extend image current border and generate new regions based on known ones. Previous methods adopt generative adversarial networks (GANs) to synthesize realistic images. However, the lack of explicit semantic representation leads to blurry and abnormal image pixels when the outpainting areas are complex and with various objects. In this work, we decompose the outpainting task into two stages. Firstly, we train a GAN to extend regions in semantic segmentation domain instead of image domain. Secondly, another GAN model is trained to synthesize real images based on the extended semantic layouts. The first model focuses on low frequent context such as sizes, classes and other semantic cues while the second model focuses on high frequent context like color and texture. By this design, our approach can handle semantic clues more easily and hence works better in complex scenarios. We evaluate our framework on various datasets and make quantitative and qualitative analysis. Experiments demonstrate that our method generates reasonable extended semantic layouts and images, outperforming state-of-the-art models.

[81]  arXiv:2110.09278 [pdf, other]
Title: A Lightweight and Accurate Recognition Framework for Signs of X-ray Weld Images
Subjects: Computer Vision and Pattern Recognition (cs.CV)

X-ray images are commonly used to ensure the security of devices in quality inspection industry. The recognition of signs printed on X-ray weld images plays an essential role in digital traceability system of manufacturing industry. However, the scales of objects vary different greatly in weld images, and it hinders us to achieve satisfactory recognition. In this paper, we propose a signs recognition framework based on convolutional neural networks (CNNs) for weld images. The proposed framework firstly contains a shallow classification network for correcting the pose of images. Moreover, we present a novel spatial and channel enhancement (SCE) module to address the above scale problem. This module can integrate multi-scale features and adaptively assign weights for each feature source. Based on SCE module, a narrow network is designed for final weld information recognition. To enhance the practicability of our framework, we carefully design the architecture of framework with a few parameters and computations. Experimental results show that our framework achieves 99.7% accuracy with 1.1 giga floating-point of operations (GFLOPs) on classification stage, and 90.0 mean average precision (mAP) with 176.1 frames per second (FPS) on recognition stage.

[82]  arXiv:2110.09286 [pdf, other]
Title: Gait-based Human Identification through Minimum Gait-phases and Sensors
Comments: Accepted in 43rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2021)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

Human identification is one of the most common and critical tasks for condition monitoring, human-machine interaction, and providing assistive services in smart environments. Recently, human gait has gained new attention as a biometric for identification to achieve contactless identification from a distance robust to physical appearances. However, an important aspect of gait identification through wearables and image-based systems alike is accurate identification when limited information is available, for example, when only a fraction of the whole gait cycle or only a part of the subject body is visible. In this paper, we present a gait identification technique based on temporal and descriptive statistic parameters of different gait phases as the features and we investigate the performance of using only single gait phases for the identification task using a minimum number of sensors. It was shown that it is possible to achieve high accuracy of over 95.5 percent by monitoring a single phase of the whole gait cycle through only a single sensor. It was also shown that the proposed methodology could be used to achieve 100 percent identification accuracy when the whole gait cycle was monitored through pelvis and foot sensors combined. The ANN was found to be more robust to fewer data features compared to SVM and was concluded as the best machine algorithm for the purpose.

[83]  arXiv:2110.09298 [pdf, other]
Title: A DCT-based Tensor Completion Approach for Recovering Color Images and Videos from Highly Undersampled Data
Comments: 13 pages, 2 tables and 8 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)

Recovering color images and videos from highly undersampled data is a fundamental and challenging task in face recognition and computer vision. By the multi-dimensional nature of color images and videos, in this paper, we propose a novel tensor completion approach, which is able to efficiently explore the sparsity of tensor data under the discrete cosine transform (DCT). Specifically, we introduce two DCT-based tensor completion models as well as two implementable algorithms for their solutions. The first one is a DCT-based weighted nuclear norm minimization model. The second one is called DCT-based $p$-shrinking tensor completion model, which is a nonconvex model utilizing $p$-shrinkage mapping for promoting the low-rankness of data. Moreover, we accordingly propose two implementable augmented Lagrangian-based algorithms for solving the underlying optimization models. A series of numerical experiments including color and MRI image inpainting and video data recovery demonstrate that our proposed approach performs better than many existing state-of-the-art tensor completion methods, especially for the case when the ratio of missing data is high.

[84]  arXiv:2110.09299 [pdf, other]
Title: A Literature Review of 3D Face Reconstruction From a Single Image
Authors: Hanxin Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper is a brief survey of the recent literature on 3D face reconstruction from a single image. Most articles have been choosen among 2016 and 2020, in order to provide the most up-to-date view of the single image 3D face reconstruction.

[85]  arXiv:2110.09348 [pdf, other]
Title: Understanding Dimensional Collapse in Contrastive Self-supervised Learning
Comments: 15 pages, 10 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Self-supervised visual representation learning aims to learn useful representations without relying on human annotations. Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image. Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution. Among these methods, contrastive learning prevents collapse via negative sample pairs. It has been shown that non-contrastive methods suffer from a lesser collapse problem of a different nature: dimensional collapse, whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space. Here, we show that dimensional collapse also happens in contrastive learning. In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse. Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on a trainable projector. Experiments show that DirectCLR outperforms SimCLR with a trainable linear projector on ImageNet.

[86]  arXiv:2110.09355 [pdf, other]
Title: FAST3D: Flow-Aware Self-Training for 3D Object Detectors
Comments: Accepted to BMVC 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

In the field of autonomous driving, self-training is widely applied to mitigate distribution shifts in LiDAR-based 3D object detectors. This eliminates the need for expensive, high-quality labels whenever the environment changes (e.g., geographic location, sensor setup, weather condition). State-of-the-art self-training approaches, however, mostly ignore the temporal nature of autonomous driving data. To address this issue, we propose a flow-aware self-training method that enables unsupervised domain adaptation for 3D object detectors on continuous LiDAR point clouds. In order to get reliable pseudo-labels, we leverage scene flow to propagate detections through time. In particular, we introduce a flow-based multi-target tracker, that exploits flow consistency to filter and refine resulting tracks. The emerged precise pseudo-labels then serve as a basis for model re-training. Starting with a pre-trained KITTI model, we conduct experiments on the challenging Waymo Open Dataset to demonstrate the effectiveness of our approach. Without any prior target domain knowledge, our results show a significant improvement over the state-of-the-art.

[87]  arXiv:2110.09374 [pdf, other]
Title: Ortho-Shot: Low Displacement Rank Regularization with Data Augmentation for Few-Shot Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

In few-shot classification, the primary goal is to learn representations from a few samples that generalize well for novel classes. In this paper, we propose an efficient low displacement rank (LDR) regularization strategy termed Ortho-Shot; a technique that imposes orthogonal regularization on the convolutional layers of a few-shot classifier, which is based on the doubly-block toeplitz (DBT) matrix structure. The regularized convolutional layers of the few-shot classifier enhances model generalization and intra-class feature embeddings that are crucial for few-shot learning. Overfitting is a typical issue for few-shot models, the lack of data diversity inhibits proper model inference which weakens the classification accuracy of few-shot learners to novel classes. In this regard, we broke down the pipeline of the few-shot classifier and established that the support, query and task data augmentation collectively alleviates overfitting in networks. With compelling results, we demonstrated that combining a DBT-based low-rank orthogonal regularizer with data augmentation strategies, significantly boosts the performance of a few-shot classifier. We perform our experiments on the miniImagenet, CIFAR-FS and Stanford datasets with performance values of about 5\% when compared to state-of-the-art

[88]  arXiv:2110.09380 [pdf, other]
Title: Learning multiplane images from single views with self-supervision
Comments: To appear on BMVC 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Generating static novel views from an already captured image is a hard task in computer vision and graphics, in particular when the single input image has dynamic parts such as persons or moving objects. In this paper, we tackle this problem by proposing a new framework, called CycleMPI, that is capable of learning a multiplane image representation from single images through a cyclic training strategy for self-supervision. Our framework does not require stereo data for training, therefore it can be trained with massive visual data from the Internet, resulting in a better generalization capability even for very challenging cases. Although our method does not require stereo data for supervision, it reaches results on stereo datasets comparable to the state of the art in a zero-shot scenario. We evaluated our method on RealEstate10K and Mannequin Challenge datasets for view synthesis and presented qualitative results on Places II dataset.

[89]  arXiv:2110.09396 [pdf, other]
Title: Streaming Machine Learning and Online Active Learning for Automated Visual Inspection
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Quality control is a key activity performed by manufacturing companies to verify product conformance to the requirements and specifications. Standardized quality control ensures that all the products are evaluated under the same criteria. The decreased cost of sensors and connectivity enabled an increasing digitalization of manufacturing and provided greater data availability. Such data availability has spurred the development of artificial intelligence models, which allow higher degrees of automation and reduced bias when inspecting the products. Furthermore, the increased speed of inspection reduces overall costs and time required for defect inspection. In this research, we compare five streaming machine learning algorithms applied to visual defect inspection with real-world data provided by Philips Consumer Lifestyle BV. Furthermore, we compare them in a streaming active learning context, which reduces the data labeling effort in a real-world context. Our results show that active learning reduces the data labeling effort by almost 15% on average for the worst case, while keeping an acceptable classification performance. The use of machine learning models for automated visual inspection are expected to speed up the quality inspection up to 40%.

[90]  arXiv:2110.09401 [pdf, other]
Title: Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

The analysis of deforming 3D surface meshes is accelerated by autoencoders since the low-dimensional embeddings can be used to visualize underlying dynamics. But, state-of-the-art mesh convolutional autoencoders require a fixed connectivity of all input meshes handled by the autoencoder. This is due to either the use of spectral convolutional layers or mesh dependent pooling operations. Therefore, the types of datasets that one can study are limited and the learned knowledge cannot be transferred to other datasets that exhibit similar behavior. To address this, we transform the discretization of the surfaces to semi-regular meshes that have a locally regular connectivity and whose meshing is hierarchical. This allows us to apply the same spatial convolutional filters to the local neighborhoods and to define a pooling operator that can be applied to every semi-regular mesh. We apply the same mesh autoencoder to different datasets and our reconstruction error is more than 50% lower than the error from state-of-the-art models, which have to be trained for every mesh separately. Additionally, we visualize the underlying dynamics of unseen mesh sequences with an autoencoder trained on different classes of meshes.

[91]  arXiv:2110.09408 [pdf, other]
Title: HRFormer: High-Resolution Transformer for Dense Prediction
Comments: Accepted at NeurIPS 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We present a High-Resolution Transformer (HRT) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolution representations and has high memory and computational cost. We take advantage of the multi-resolution parallel design introduced in high-resolution convolutional networks (HRNet), along with local-window self-attention that performs self-attention over small non-overlapping image windows, for improving the memory and computation efficiency. In addition, we introduce a convolution into the FFN to exchange information across the disconnected image windows. We demonstrate the effectiveness of the High-Resolution Transformer on both human pose estimation and semantic segmentation tasks, e.g., HRT outperforms Swin transformer by $1.3$ AP on COCO pose estimation with $50\%$ fewer parameters and $30\%$ fewer FLOPs. Code is available at: https://github.com/HRNet/HRFormer.

[92]  arXiv:2110.09415 [pdf, other]
Title: NeuralBlox: Real-Time Neural Representation Fusion for Robust Volumetric Mapping
Comments: 3DV 2021. Equal contribution between the first two authors. Code: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

We present a novel 3D mapping method leveraging the recent progress in neural implicit representation for 3D reconstruction. Most existing state-of-the-art neural implicit representation methods are limited to object-level reconstructions and can not incrementally perform updates given new data. In this work, we propose a fusion strategy and training pipeline to incrementally build and update neural implicit representations that enable the reconstruction of large scenes from sequential partial observations. By representing an arbitrarily sized scene as a grid of latent codes and performing updates directly in latent space, we show that incrementally built occupancy maps can be obtained in real-time even on a CPU. Compared to traditional approaches such as Truncated Signed Distance Fields (TSDFs), our map representation is significantly more robust in yielding a better scene completeness given noisy inputs. We demonstrate the performance of our approach in thorough experimental validation on real-world datasets with varying degrees of added pose noise.

[93]  arXiv:2110.09424 [pdf, other]
Title: Don't Judge Me by My Face : An Indirect Adversarial Approach to Remove Sensitive Information From Multimodal Neural Representation in Asynchronous Job Video Interviews
Comments: published in ACII 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

se of machine learning for automatic analysis of job interview videos has recently seen increased interest. Despite claims of fair output regarding sensitive information such as gender or ethnicity of the candidates, the current approaches rarely provide proof of unbiased decision-making, or that sensitive information is not used. Recently, adversarial methods have been proved to effectively remove sensitive information from the latent representation of neural networks. However, these methods rely on the use of explicitly labeled protected variables (e.g. gender), which cannot be collected in the context of recruiting in some countries (e.g. France). In this article, we propose a new adversarial approach to remove sensitive information from the latent representation of neural networks without the need to collect any sensitive variable. Using only a few frames of the interview, we train our model to not be able to find the face of the candidate related to the job interview in the inner layers of the model. This, in turn, allows us to remove relevant private information from these layers. Comparing our approach to a standard baseline on a public dataset with gender and ethnicity annotations, we show that it effectively removes sensitive information from the main network. Moreover, to the best of our knowledge, this is the first application of adversarial techniques for obtaining a multimodal fair representation in the context of video job interviews. In summary, our contributions aim at improving fairness of the upcoming automatic systems processing videos of job interviews for equality in job selection.

[94]  arXiv:2110.09425 [pdf, other]
Title: FacialGAN: Style Transfer and Attribute Manipulation on Synthetic Faces
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Facial image manipulation is a generation task where the output face is shifted towards an intended target direction in terms of facial attribute and styles. Recent works have achieved great success in various editing techniques such as style transfer and attribute translation. However, current approaches are either focusing on pure style transfer, or on the translation of predefined sets of attributes with restricted interactivity. To address this issue, we propose FacialGAN, a novel framework enabling simultaneous rich style transfers and interactive facial attributes manipulation. While preserving the identity of a source image, we transfer the diverse styles of a target image to the source image. We then incorporate the geometry information of a segmentation mask to provide a fine-grained manipulation of facial attributes. Finally, a multi-objective learning strategy is introduced to optimize the loss of each specific tasks. Experiments on the CelebA-HQ dataset, with CelebAMask-HQ as semantic mask labels, show our model's capacity in producing visually compelling results in style transfer, attribute manipulation, diversity and face verification. For reproducibility, we provide an interactive open-source tool to perform facial manipulations, and the Pytorch implementation of the model.

[95]  arXiv:2110.09428 [pdf, other]
Title: Distinguishing Natural and Computer-Generated Images using Multi-Colorspace fused EfficientNet
Comments: 13 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

The problem of distinguishing natural images from photo-realistic computer-generated ones either addresses natural images versus computer graphics or natural images versus GAN images, at a time. But in a real-world image forensic scenario, it is highly essential to consider all categories of image generation, since in most cases image generation is unknown. We, for the first time, to our best knowledge, approach the problem of distinguishing natural images from photo-realistic computer-generated images as a three-class classification task classifying natural, computer graphics, and GAN images. For the task, we propose a Multi-Colorspace fused EfficientNet model by parallelly fusing three EfficientNet networks that follow transfer learning methodology where each network operates in different colorspaces, RGB, LCH, and HSV, chosen after analyzing the efficacy of various colorspace transformations in this image forensics problem. Our model outperforms the baselines in terms of accuracy, robustness towards post-processing, and generalizability towards other datasets. We conduct psychophysics experiments to understand how accurately humans can distinguish natural, computer graphics, and GAN images where we could observe that humans find difficulty in classifying these images, particularly the computer-generated images, indicating the necessity of computational algorithms for the task. We also analyze the behavior of our model through visual explanations to understand salient regions that contribute to the model's decision making and compare with manual explanations provided by human participants in the form of region markings, where we could observe similarities in both the explanations indicating the powerful nature of our model to take the decisions meaningfully.

[96]  arXiv:2110.09455 [pdf, other]
Title: TLDR: Twin Learning for Dimensionality Reduction
Comments: Code available at: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Dimensionality reduction methods are unsupervised approaches which learn low-dimensional spaces where some properties of the initial space, typically the notion of "neighborhood", are preserved. They are a crucial component of diverse tasks like visualization, compression, indexing, and retrieval. Aiming for a totally different goal, self-supervised visual representation learning has been shown to produce transferable representation functions by learning models that encode invariance to artificially created distortions, e.g. a set of hand-crafted image transformations. Unlike manifold learning methods that usually require propagation on large k-NN graphs or complicated optimization solvers, self-supervised learning approaches rely on simpler and more scalable frameworks for learning. In this paper, we unify these two families of approaches from the angle of manifold learning and propose TLDR, a dimensionality reduction method for generic input spaces that is porting the simple self-supervised learning framework of Barlow Twins to a setting where it is hard or impossible to define an appropriate set of distortions by hand. We propose to use nearest neighbors to build pairs from a training set and a redundancy reduction loss borrowed from the self-supervised literature to learn an encoder that produces representations invariant across such pairs. TLDR is a method that is simple, easy to implement and train, and of broad applicability; it consists of an offline nearest neighbor computation step that can be highly approximated, and a straightforward learning process that does not require mining negative samples to contrast, eigendecompositions, or cumbersome optimization solvers. By replacing PCA with TLDR, we are able to increase the performance of GeM-AP by 4% mAP for 128 dimensions, and to retain its performance with 16x fewer dimensions.

[97]  arXiv:2110.09470 [pdf, other]
Title: No RL, No Simulation: Learning to Navigate without Navigating
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Most prior methods for learning navigation policies require access to simulation environments, as they need online policy interaction and rely on ground-truth maps for rewards. However, building simulators is expensive (requires manual effort for each and every scene) and creates challenges in transferring learned policies to robotic platforms in the real-world, due to the sim-to-real domain gap. In this paper, we pose a simple question: Do we really need active interaction, ground-truth maps or even reinforcement-learning (RL) in order to solve the image-goal navigation task? We propose a self-supervised approach to learn to navigate from only passive videos of roaming. Our approach, No RL, No Simulator (NRNS), is simple and scalable, yet highly effective. NRNS outperforms RL-based formulations by a significant margin. We present NRNS as a strong baseline for any future image-based navigation tasks that use RL or Simulation.

[98]  arXiv:2110.09481 [pdf, other]
Title: MTP: Multi-Hypothesis Tracking and Prediction for Reduced Error Propagation
Comments: Project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA); Robotics (cs.RO)

Recently, there has been tremendous progress in developing each individual module of the standard perception-planning robot autonomy pipeline, including detection, tracking, prediction of other agents' trajectories, and ego-agent trajectory planning. Nevertheless, there has been less attention given to the principled integration of these components, particularly in terms of the characterization and mitigation of cascading errors. This paper addresses the problem of cascading errors by focusing on the coupling between the tracking and prediction modules. First, by using state-of-the-art tracking and prediction tools, we conduct a comprehensive experimental evaluation of how severely errors stemming from tracking can impact prediction performance. On the KITTI and nuScenes datasets, we find that predictions consuming tracked trajectories as inputs (the typical case in practice) can experience a significant (even order of magnitude) drop in performance in comparison to the idealized setting where ground truth past trajectories are used as inputs. To address this issue, we propose a multi-hypothesis tracking and prediction framework. Rather than relying on a single set of tracking results for prediction, our framework simultaneously reasons about multiple sets of tracking results, thereby increasing the likelihood of including accurate tracking results as inputs to prediction. We show that this framework improves overall prediction performance over the standard single-hypothesis tracking-prediction pipeline by up to 34.2% on the nuScenes dataset, with even more significant improvements (up to ~70%) when restricting the evaluation to challenging scenarios involving identity switches and fragments -- all with an acceptable computation overhead.

[99]  arXiv:2110.09482 [pdf, other]
Title: Self-Supervised Monocular DepthEstimation with Internal Feature Fusion
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Self-supervised learning for depth estimation uses geometry in image sequences for supervision and shows promising results. Like many computer vision tasks, depth network performance is determined by the capability to learn accurate spatial and semantic representations from images. Therefore, it is natural to exploit semantic segmentation networks for depth estimation. In this work, based on a well-developed semantic segmentation network HRNet, we propose a novel depth estimation networkDIFFNet, which can make use of semantic information in down and upsampling procedures. By applying feature fusion and an attention mechanism, our proposed method outperforms the state-of-the-art monocular depth estimation methods on the KITTI benchmark. Our method also demonstrates greater potential on higher resolution training data. We propose an additional extended evaluation strategy by establishing a test set of challenging cases, empirically derived from the standard benchmark.

[100]  arXiv:2110.09490 [pdf, other]
Title: Unsupervised Image Fusion Using Deep Image Priors
Subjects: Computer Vision and Pattern Recognition (cs.CV)

A significant number of researchers have recently applied deep learning methods to image fusion. However, most of these works either require a large amount of training data or depend on pre-trained models or frameworks. This inevitably encounters a shortage of training data or a mismatch between the framework and the actual problem. Recently, the publication of Deep Image Prior (DIP) method made it possible to do image restoration totally training-data-free. However, the original design of DIP is hard to be generalized to multi-image processing problems. This paper introduces a novel loss calculation structure, in the framework of DIP, while formulating image fusion as an inverse problem. This enables the extension of DIP to general multisensor/multifocus image fusion problems. Secondly, we propose a multi-channel approach to improve the effect of DIP. Finally, an evaluation is conducted using several commonly used image fusion assessment metrics. The results are compared with state-of-the-art traditional and deep learning image fusion methods. Our method outperforms previous techniques for a range of metrics. In particular, it is shown to provide the best objective results for most metrics when applied to medical images.

[101]  arXiv:2110.09508 [pdf, other]
Title: Deep CNNs for Peripheral Blood Cell Classification
Comments: 20 pages, 14 figures, Submitted at MIDL 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The application of machine learning techniques to the medical domain is especially challenging due to the required level of precision and the incurrence of huge risks of minute errors. Employing these techniques to a more complex subdomain of hematological diagnosis seems quite promising, with automatic identification of blood cell types, which can help in detection of hematologic disorders. In this paper, we benchmark 27 popular deep convolutional neural network architectures on the microscopic peripheral blood cell images dataset. The dataset is publicly available, with large number of normal peripheral blood cells acquired using the CellaVision DM96 analyzer and identified by expert pathologists into eight different cell types. We fine-tune the state-of-the-art image classification models pre-trained on the ImageNet dataset for blood cell classification. We exploit data augmentation techniques during training to avoid overfitting and achieve generalization. An ensemble of the top performing models obtains significant improvements over past published works, achieving the state-of-the-art results with a classification accuracy of 99.51%. Our work provides empirical baselines and benchmarks on standard deep-learning architectures for microscopic peripheral blood cell recognition task.

[102]  arXiv:2110.09510 [pdf, other]
Title: Unsupervised Finetuning
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

This paper studies "unsupervised finetuning", the symmetrical problem of the well-known "supervised finetuning". Given a pretrained model and small-scale unlabeled target data, unsupervised finetuning is to adapt the representation pretrained from the source domain to the target domain so that better transfer performance can be obtained. This problem is more challenging than the supervised counterpart, as the low data density in the small-scale target data is not friendly for unsupervised learning, leading to the damage of the pretrained representation and poor representation in the target domain. In this paper, we find the source data is crucial when shifting the finetuning paradigm from supervise to unsupervise, and propose two simple and effective strategies to combine source and target data into unsupervised finetuning: "sparse source data replaying", and "data mixing". The motivation of the former strategy is to add a small portion of source data back to occupy their pretrained representation space and help push the target data to reside in a smaller compact space; and the motivation of the latter strategy is to increase the data density and help learn more compact representation. To demonstrate the effectiveness of our proposed ``unsupervised finetuning'' strategy, we conduct extensive experiments on multiple different target datasets, which show better transfer performance than the naive strategy.

Cross-lists for Tue, 19 Oct 21

[103]  arXiv:2102.10663 (cross-list from eess.IV) [pdf, other]
Title: MedAug: Contrastive learning leveraging patient metadata improves representations for chest X-ray interpretation
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Self-supervised contrastive learning between pairs of multiple views of the same image has been shown to successfully leverage unlabeled data to produce meaningful visual representations for both natural and medical images. However, there has been limited work on determining how to select pairs for medical images, where availability of patient metadata can be leveraged to improve representations. In this work, we develop a method to select positive pairs coming from views of possibly different images through the use of patient metadata. We compare strategies for selecting positive pairs for chest X-ray interpretation including requiring them to be from the same patient, imaging study or laterality. We evaluate downstream task performance by fine-tuning the linear layer on 1% of the labeled dataset for pleural effusion classification. Our best performing positive pair selection strategy, which involves using images from the same patient from the same study across all lateralities, achieves a performance increase of 14.4% in mean AUC from the ImageNet pretrained baseline. Our controlled experiments show that the keys to improving downstream performance on disease classification are (1) using patient metadata to appropriately create positive pairs from different images with the same underlying pathologies, and (2) maximizing the number of different images used in query pairing. In addition, we explore leveraging patient metadata to select hard negative pairs for contrastive learning, but do not find improvement over baselines that do not use metadata. Our method is broadly applicable to medical image interpretation and allows flexibility for incorporating medical insights in choosing pairs for contrastive learning.

[104]  arXiv:2110.08263 (cross-list from cs.LG) [pdf, other]
Title: FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling
Comments: Accepted by NeurIPS 2021; 16 pages with appendix; code: this https URL
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

The recently proposed FixMatch achieved state-of-the-art results on most semi-supervised learning (SSL) benchmarks. However, like other modern SSL algorithms, FixMatch uses a pre-defined constant threshold for all classes to select unlabeled data that contribute to the training, thus failing to consider different learning status and learning difficulties of different classes. To address this issue, we propose Curriculum Pseudo Labeling (CPL), a curriculum learning approach to leverage unlabeled data according to the model's learning status. The core of CPL is to flexibly adjust thresholds for different classes at each time step to let pass informative unlabeled data and their pseudo labels. CPL does not introduce additional parameters or computations (forward or backward propagation). We apply CPL to FixMatch and call our improved algorithm FlexMatch. FlexMatch achieves state-of-the-art performance on a variety of SSL benchmarks, with especially strong performances when the labeled data are extremely limited or when the task is challenging. For example, FlexMatch outperforms FixMatch by 14.32% and 24.55% on CIFAR-100 and STL-10 datasets respectively, when there are only 4 labels per class. CPL also significantly boosts the convergence speed, e.g., FlexMatch can use only 1/5 training time of FixMatch to achieve even better performance. Furthermore, we show that CPL can be easily adapted to other SSL algorithms and remarkably improve their performances. We open source our code at https://github.com/TorchSSL/TorchSSL.

[105]  arXiv:2110.08271 (cross-list from cs.LG) [pdf, other]
Title: Training Deep Neural Networks with Joint Quantization and Pruning of Weights and Activations
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Quantization and pruning are core techniques used to reduce the inference costs of deep neural networks. State-of-the-art quantization techniques are currently applied to both the weights and activations; however, pruning is most often applied to only the weights of the network. In this work, we jointly apply novel uniform quantization and unstructured pruning methods to both the weights and activations of deep neural networks during training. Using our methods, we empirically evaluate the currently accepted prune-then-quantize paradigm across a wide range of computer vision tasks and observe a non-commutative nature when applied to both the weights and activations of deep neural networks. Informed by these observations, we articulate the non-commutativity hypothesis: for a given deep neural network being trained for a specific task, there exists an exact training schedule in which quantization and pruning can be introduced to optimize network performance. We identify that this optimal ordering not only exists, but also varies across discriminative and generative tasks. Using the optimal training schedule within our training framework, we demonstrate increased performance per memory footprint over existing solutions.

[106]  arXiv:2110.08377 (cross-list from cs.RO) [pdf, other]
Title: Starkit: RoboCup Humanoid KidSize 2021 Worldwide Champion Team Paper
Comments: 15 pages, 10 figures
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

This article is devoted to the features that were under development between RoboCup 2019 Sydney and RoboCup 2021 Worldwide. These features include vision-related matters, such as detection and localization, mechanical and algorithmic novelties. Since the competition was held virtually, the simulation-specific features are also considered in the article. We give an overview of the approaches that were tried out along with the analysis of their preconditions, perspectives and the evaluation of their performance.

[107]  arXiv:2110.08407 (cross-list from eess.IV) [pdf, other]
Title: Bridging the gap between paired and unpaired medical image translation
Comments: Deep Generative Models for MICCAI (DGM4MICCAI) workshop 2021
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Medical image translation has the potential to reduce the imaging workload, by removing the need to capture some sequences, and to reduce the annotation burden for developing machine learning methods. GANs have been used successfully to translate images from one domain to another, such as MR to CT. At present, paired data (registered MR and CT images) or extra supervision (e.g. segmentation masks) is needed to learn good translation models. Registering multiple modalities or annotating structures within each of them is a tedious and laborious task. Thus, there is a need to develop improved translation methods for unpaired data. Here, we introduce modified pix2pix models for tasks CT$\rightarrow$MR and MR$\rightarrow$CT, trained with unpaired CT and MR data, and MRCAT pairs generated from the MR scans. The proposed modifications utilize the paired MR and MRCAT images to ensure good alignment between input and translated images, and unpaired CT images ensure the MR$\rightarrow$CT model produces realistic-looking CT and CT$\rightarrow$MR model works well with real CT as input. The proposed pix2pix variants outperform baseline pix2pix, pix2pixHD and CycleGAN in terms of FID and KID, and generate more realistic looking CT and MR translations.

[108]  arXiv:2110.08424 (cross-list from eess.IV) [pdf]
Title: Deep learning-based detection of intravenous contrast in computed tomography scans
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Purpose: Identifying intravenous (IV) contrast use within CT scans is a key component of data curation for model development and testing. Currently, IV contrast is poorly documented in imaging metadata and necessitates manual correction and annotation by clinician experts, presenting a major barrier to imaging analyses and algorithm deployment. We sought to develop and validate a convolutional neural network (CNN)-based deep learning (DL) platform to identify IV contrast within CT scans. Methods: For model development and evaluation, we used independent datasets of CT scans of head, neck (HN) and lung cancer patients, totaling 133,480 axial 2D scan slices from 1,979 CT scans manually annotated for contrast presence by clinical experts. Five different DL models were adopted and trained in HN training datasets for slice-level contrast detection. Model performances were evaluated on a hold-out set and on an independent validation set from another institution. DL models was then fine-tuned on chest CT data and externally validated on a separate chest CT dataset. Results: Initial DICOM metadata tags for IV contrast were missing or erroneous in 1,496 scans (75.6%). The EfficientNetB4-based model showed the best overall detection performance. For HN scans, AUC was 0.996 in the internal validation set (n = 216) and 1.0 in the external validation set (n = 595). The fine-tuned model on chest CTs yielded an AUC: 1.0 for the internal validation set (n = 53), and AUC: 0.980 for the external validation set (n = 402). Conclusion: The DL model could accurately detect IV contrast in both HN and chest CT scans with near-perfect performance.

[109]  arXiv:2110.08427 (cross-list from eess.IV) [pdf, other]
Title: COVID-19 Detection in Chest X-ray Images Using Swin-Transformer and Transformer in Transformer
Comments: COVID-19, Chest X-ray Images, Swin-Transformer, Transformer in Transformer, Model Ensembling, Image Classification
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

The Coronavirus Disease 2019 (COVID-19) has spread globally and caused serious damages. Chest X-ray images are widely used for COVID-19 diagnosis and Artificial Intelligence method can assist to increase the efficiency and accuracy. In the Challenge of Chest XR COVID-19 detection in Ethics and Explainability for Responsible Data Science (EE-RDS) conference 2021, we proposed a method which combined Swin Transformer and Transformer in Transformer to classify chest X-ray images as three classes: COVID-19, Pneumonia and Normal (healthy) and achieved 0.9475 accuracy on test set.

[110]  arXiv:2110.08446 (cross-list from cs.AI) [pdf, other]
Title: Self-Annotated Training for Controllable Image Captioning
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

The Controllable Image Captioning (CIC) task aims to generate captions conditioned on designated control signals. In this paper, we improve CIC from two aspects: 1) Existing reinforcement training methods are not applicable to structure-related CIC models due to the fact that the accuracy-based reward focuses mainly on contents rather than semantic structures. The lack of reinforcement training prevents the model from generating more accurate and controllable sentences. To solve the problem above, we propose a novel reinforcement training method for structure-related CIC models: Self-Annotated Training (SAT), where a recursive sampling mechanism (RSM) is designed to force the input control signal to match the actual output sentence. Extensive experiments conducted on MSCOCO show that our SAT method improves C-Transformer (XE) on CIDEr-D score from 118.6 to 130.1 in the length-control task and from 132.2 to 142.7 in the tense-control task, while maintaining more than 99$\%$ matching accuracy with the control signal. 2) We introduce a new control signal: sentence quality. Equipped with it, CIC models are able to generate captions of different quality levels as needed. Experiments show that without additional information of ground truth captions, models controlled by the highest level of sentence quality perform much better in accuracy than baseline models.

[111]  arXiv:2110.08486 (cross-list from cs.CL) [pdf, other]
Title: Understanding Procedural Knowledge by Sequencing Multimodal Instructional Manuals
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

The ability to sequence unordered events is an essential skill to comprehend and reason about real world task procedures, which often requires thorough understanding of temporal common sense and multimodal information, as these procedures are often communicated through a combination of texts and images. Such capability is essential for applications such as sequential task planning and multi-source instruction summarization. While humans are capable of reasoning about and sequencing unordered multimodal procedural instructions, whether current machine learning models have such essential capability is still an open question. In this work, we benchmark models' capability of reasoning over and sequencing unordered multimodal instructions by curating datasets from popular online instructional manuals and collecting comprehensive human annotations. We find models not only perform significantly worse than humans but also seem incapable of efficiently utilizing the multimodal information. To improve machines' performance on multimodal event sequencing, we propose sequentiality-aware pretraining techniques that exploit the sequential alignment properties of both texts and images, resulting in > 5% significant improvements.

[112]  arXiv:2110.08509 (cross-list from eess.IV) [pdf, other]
Title: BAPGAN: GAN-based Bone Age Progression of Femur and Phalange X-ray Images
Comments: 6 pages, 5 figures, accepted to SPIE Medical Imaging 2022
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Convolutional Neural Networks play a key role in bone age assessment for investigating endocrinology, genetic, and growth disorders under various modalities and body regions. However, no researcher has tackled bone age progression/regression despite its valuable potential applications: bone-related disease diagnosis, clinical knowledge acquisition, and museum education. Therefore, we propose Bone Age Progression Generative Adversarial Network (BAPGAN) to progress/regress both femur/phalange X-ray images while preserving identity and realism. We exhaustively confirm the BAPGAN's clinical potential via Frechet Inception Distance, Visual Turing Test by two expert orthopedists, and t-Distributed Stochastic Neighbor Embedding.

[113]  arXiv:2110.08515 (cross-list from cs.CL) [pdf, other]
Title: Multimodal Dialogue Response Generation
Comments: This paper has been submitted before 15th October @ 11:59pm AOE(UTC -12)
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)

Responsing with image has been recognized as an important capability for an intelligent conversational agent. Yet existing works only focus on exploring the multimodal dialogue models which depend on retrieval-based methods, but neglecting generation methods. To fill in the gaps, we first present a multimodal dialogue generation model, which takes the dialogue history as input, then generates a textual sequence or an image as response. Learning such a model often requires multimodal dialogues containing both texts and images which are difficult to obtain. Motivated by the challenge in practice, we consider multimodal dialogue generation under a natural assumption that only limited training examples are available. In such a low-resource setting, we devise a novel conversational agent, Divter, in order to isolate parameters that depend on multimodal dialogues from the entire generation model. By this means, the major part of the model can be learned from a large number of text-only dialogues and text-image pairs respectively, then the whole parameters can be well fitted using the limited training examples. Extensive experiments demonstrate our method achieves state-of-the-art results in both automatic and human evaluation, and can generate informative text and high-resolution image responses.

[114]  arXiv:2110.08521 (cross-list from eess.IV) [pdf, other]
Title: Locally Adaptive Structure and Texture Similarity for Image Quality Assessment
Journal-ref: Proceedings of the 29th ACM International Conference on Multimedia, 2021
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

The latest advances in full-reference image quality assessment (IQA) involve unifying structure and texture similarity based on deep representations. The resulting Deep Image Structure and Texture Similarity (DISTS) metric, however, makes rather global quality measurements, ignoring the fact that natural photographic images are locally structured and textured across space and scale. In this paper, we describe a locally adaptive structure and texture similarity index for full-reference IQA, which we term A-DISTS. Specifically, we rely on a single statistical feature, namely the dispersion index, to localize texture regions at different scales. The estimated probability (of one patch being texture) is in turn used to adaptively pool local structure and texture measurements. The resulting A-DISTS is adapted to local image content, and is free of expensive human perceptual scores for supervised training. We demonstrate the advantages of A-DISTS in terms of correlation with human data on ten IQA databases and optimization of single image super-resolution methods.

[115]  arXiv:2110.08569 (cross-list from eess.IV) [pdf, other]
Title: Deep Image Debanding
Comments: 5 pages, 4 figures, 5 tables
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Banding or false contour is an annoying visual artifact whose impact is even more pronounced in ultra high definition, high dynamic range, and wide colour gamut visual content, which is becoming increasingly popular. Since users associate a heightened expectation of quality with such content and banding leads to deteriorated visual quality-of-experience, the area of banding removal or debanding has taken paramount importance. Existing debanding approaches are mostly knowledge-driven. Despite the widespread success of deep learning in other areas of image processing and computer vision, data-driven debanding approaches remain surprisingly missing. In this work, we make one of the first attempts to develop a deep learning based banding artifact removal method for images and name it deep debanding network (deepDeband). For its training, we construct a large-scale dataset of 51,490 pairs of corresponding pristine and banded image patches. Performance evaluation shows that deepDeband is successful at greatly reducing banding artifacts in images, outperforming existing methods both quantitatively and visually.

[116]  arXiv:2110.08595 (cross-list from eess.SP) [pdf, other]
Title: A MIMO Radar-based Few-Shot Learning Approach for Human-ID
Comments: 5 pages, 6 figures, 2 tables
Subjects: Signal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)

Radar for deep learning-based human identification has become a research area of increasing interest. It has been shown that micro-Doppler (\(\upmu\)-D) can reflect the walking behavior through capturing the periodic limbs' micro-motions. One of the main aspects is maximizing the number of included classes while considering the real-time and training dataset size constraints. In this paper, a multiple-input-multiple-output (MIMO) radar is used to formulate micro-motion spectrograms of the elevation angular velocity (\(\upmu\)-\(\omega\)). The effectiveness of concatenating this newly-formulated spectrogram with the commonly used \(\upmu\)-D is investigated. To accommodate for non-constrained real walking motion, an adaptive cycle segmentation framework is utilized and a metric learning network is trained on half gait cycles (\(\approx\) 0.5 s). Studies on the effects of various numbers of classes (5--20), different dataset sizes, and varying observation time windows 1--2 s are conducted. A non-constrained walking dataset of 22 subjects is collected with different aspect angles with respect to the radar. The proposed few-shot learning (FSL) approach achieves a classification error of 11.3 % with only 2 min of training data per subject.

[117]  arXiv:2110.08599 (cross-list from cs.LG) [pdf, other]
Title: Mapping illegal waste dumping sites with neural-network classification of satellite imagery
Comments: 5 pages, 3 figures, KDD Workshop on Data-driven Humanitarian Mapping held with the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, August 14, 2021
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

Public health and habitat quality are crucial goals of urban planning. In recent years, the severe social and environmental impact of illegal waste dumping sites has made them one of the most serious problems faced by cities in the Global South, in a context of scarce information available for decision making. To help identify the location of dumping sites and track their evolution over time we adopt a data-driven model from the machine learning domain, analyzing satellite images. This allows us to take advantage of the increasing availability of geo-spatial open-data, high-resolution satellite imagery, and open source tools to train machine learning algorithms with a small set of known waste dumping sites in Buenos Aires, and then predict the location of other sites over vast areas at high speed and low cost. This case study shows the results of a collaboration between Dymaxion Labs and Fundaci\'on Bunge y Born to harness this technique in order to create a comprehensive map of potential locations of illegal waste dumping sites in the region.

[118]  arXiv:2110.08610 (cross-list from cs.HC) [pdf, other]
Title: MAAD: A Model and Dataset for "Attended Awareness" in Driving
Comments: 25 pages, 13 figures, 14 tables, Accepted at EPIC@ICCV 2021 Workshop. Main paper + Supplementary Material
Subjects: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

We propose a computational model to estimate a person's attended awareness of their environment. We define attended awareness to be those parts of a potentially dynamic scene which a person has attended to in recent history and which they are still likely to be physically aware of. Our model takes as input scene information in the form of a video and noisy gaze estimates, and outputs visual saliency, a refined gaze estimate, and an estimate of the person's attended awareness. In order to test our model, we capture a new dataset with a high-precision gaze tracker including 24.5 hours of gaze sequences from 23 subjects attending to videos of driving scenes. The dataset also contains third-party annotations of the subjects' attended awareness based on observations of their scan path. Our results show that our model is able to reasonably estimate attended awareness in a controlled setting, and in the future could potentially be extended to real egocentric driving data to help enable more effective ahead-of-time warnings in safety systems and thereby augment driver performance. We also demonstrate our model's effectiveness on the tasks of saliency, gaze calibration, and denoising, using both our dataset and an existing saliency dataset. We make our model and dataset available at https://github.com/ToyotaResearchInstitute/att-aware/.

[119]  arXiv:2110.08619 (cross-list from eess.IV) [pdf, other]
Title: SAGAN: Adversarial Spatial-asymmetric Attention for Noisy Nona-Bayer Reconstruction
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Nona-Bayer colour filter array (CFA) pattern is considered one of the most viable alternatives to traditional Bayer patterns. Despite the substantial advantages, such non-Bayer CFA patterns are susceptible to produce visual artefacts while reconstructing RGB images from noisy sensor data. This study addresses the challenges of learning RGB image reconstruction from noisy Nona-Bayer CFA comprehensively. We propose a novel spatial-asymmetric attention module to jointly learn bi-direction transformation and large-kernel global attention to reduce the visual artefacts. We combine our proposed module with adversarial learning to produce plausible images from Nona-Bayer CFA. The feasibility of the proposed method has been verified and compared with the state-of-the-art image reconstruction method. The experiments reveal that the proposed method can reconstruct RGB images from noisy Nona-Bayer CFA without producing any visually disturbing artefacts. Also, it can outperform the state-of-the-art image reconstruction method in both qualitative and quantitative comparison. Code available: https://github.com/sharif-apu/SAGAN_BMVC21.

[120]  arXiv:2110.08721 (cross-list from eess.IV) [pdf, other]
Title: CAE-Transformer: Transformer-based Model to Predict Invasiveness of Lung Adenocarcinoma Subsolid Nodules from Non-thin Section 3D CT Scans
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Lung cancer is the leading cause of mortality from cancer worldwide and has various histologic types, among which Lung Adenocarcinoma (LAUC) has recently been the most prevalent. Lung adenocarcinomas are classified as pre-invasive, minimally invasive, and invasive adenocarcinomas. Timely and accurate knowledge of the invasiveness of lung nodules leads to a proper treatment plan and reduces the risk of unnecessary or late surgeries. Currently, the primary imaging modality to assess and predict the invasiveness of LAUCs is the chest CT. The results based on CT images, however, are subjective and suffer from a low accuracy compared to the ground truth pathological reviews provided after surgical resections. In this paper, a predictive transformer-based framework, referred to as the "CAE-Transformer", is developed to classify LAUCs. The CAE-Transformer utilizes a Convolutional Auto-Encoder (CAE) to automatically extract informative features from CT slices, which are then fed to a modified transformer model to capture global inter-slice relations. Experimental results on our in-house dataset of 114 pathologically proven Sub-Solid Nodules (SSNs) demonstrate the superiority of the CAE-Transformer over the histogram/radiomics-based models and its deep learning-based counterparts, achieving an accuracy of 87.73%, sensitivity of 88.67%, specificity of 86.33%, and AUC of 0.913, using a 10-fold cross-validation.

[121]  arXiv:2110.08726 (cross-list from eess.IV) [pdf, other]
Title: Data Shapley Value for Handling Noisy Labels: An application in Screening COVID-19 Pneumonia from Chest CT Scans
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

A long-standing challenge of deep learning models involves how to handle noisy labels, especially in applications where human lives are at stake. Adoption of the data Shapley Value (SV), a cooperative game theoretical approach, is an intelligent valuation solution to tackle the issue of noisy labels. Data SV can be used together with a learning model and an evaluation metric to validate each training point's contribution to the model's performance. The SV of a data point, however, is not unique and depends on the learning model, the evaluation metric, and other data points collaborating in the training game. However, effects of utilizing different evaluation metrics for computation of the SV, detecting the noisy labels, and measuring the data points' importance has not yet been thoroughly investigated. In this context, we performed a series of comparative analyses to assess SV's capabilities to detect noisy input labels when measured by different evaluation metrics. Our experiments on COVID-19-infected of CT images illustrate that although the data SV can effectively identify noisy labels, adoption of different evaluation metric can significantly influence its ability to identify noisy labels from different data classes. Specifically, we demonstrate that the SV greatly depends on the associated evaluation metric.

[122]  arXiv:2110.08776 (cross-list from eess.IV) [pdf, other]
Title: Self-Supervised U-Net for Segmenting Flat and Sessile Polyps
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Colorectal Cancer(CRC) poses a great risk to public health. It is the third most common cause of cancer in the US. Development of colorectal polyps is one of the earliest signs of cancer. Early detection and resection of polyps can greatly increase survival rate to 90%. Manual inspection can cause misdetections because polyps vary in color, shape, size and appearance. To this end, Computer-Aided Diagnosis systems(CADx) has been proposed that detect polyps by processing the colonoscopic videos. The system acts a secondary check to help clinicians reduce misdetections so that polyps may be resected before they transform to cancer. Polyps vary in color, shape, size, texture and appearance. As a result, the miss rate of polyps is between 6% and 27% despite the prominence of CADx solutions. Furthermore, sessile and flat polyps which have diameter less than 10 mm are more likely to be undetected. Convolutional Neural Networks(CNN) have shown promising results in polyp segmentation. However, all of these works have a supervised approach and are limited by the size of the dataset. It was observed that smaller datasets reduce the segmentation accuracy of ResUNet++. We train a U-Net to inpaint randomly dropped out pixels in the image as a proxy task. The dataset we use for pre-training is Kvasir-SEG dataset. This is followed by a supervised training on the limited Kvasir-Sessile dataset. Our experimental results demonstrate that with limited annotated dataset and a larger unlabeled dataset, self-supervised approach is a better alternative than fully supervised approach. Specifically, our self-supervised U-Net performs better than five segmentation models which were trained in supervised manner on the Kvasir-Sessile dataset.

[123]  arXiv:2110.08811 (cross-list from eess.IV) [pdf, other]
Title: Attention W-Net: Improved Skip Connections for better Representations
Comments: Under review at ICASSP 2022, Singapore
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Segmentation of macro and microvascular structures in fundoscopic retinal images plays a crucial role in detection of multiple retinal and systemic diseases, yet it is a difficult problem to solve. Most deep learning approaches for this task involve an autoencoder based architecture, but they face several issues such as lack of enough parameters, overfitting when there are enough parameters and incompatibility between internal feature-spaces. Due to such issues, these techniques are hence not able to extract the best semantic information from the limited data present for such tasks. We propose Attention W-Net, a new U-Net based architecture for retinal vessel segmentation to address these problems. In this architecture with a LadderNet backbone, we have two main contributions: Attention Block and regularisation measures. Our Attention Block uses decoder features to attend over the encoder features from skip-connections during upsampling, resulting in higher compatibility when the encoder and decoder features are added. Our regularisation measures include image augmentation and modifications to the ResNet Block used, which prevent overfitting. With these additions, we observe an AUC and F1-Score of 0.8407 and 0.9833 - a sizeable improvement over its LadderNet backbone as well as competitive performance among the contemporary state-of-the-art methods.

[124]  arXiv:2110.08812 (cross-list from eess.IV) [pdf]
Title: Rheumatoid Arthritis: Automated Scoring of Radiographic Joint Damage
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Rheumatoid arthritis is an autoimmune disease that causes joint damage due to inflammation in the soft tissue lining the joints known as the synovium. It is vital to identify joint damage as soon as possible to provide necessary treatment early and prevent further damage to the bone structures. Radiographs are often used to assess the extent of the joint damage. Currently, the scoring of joint damage from the radiograph takes expertise, effort, and time. Joint damage associated with rheumatoid arthritis is also not quantitated in clinical practice and subjective descriptors are used. In this work, we describe a pipeline of deep learning models to automatically identify and score rheumatoid arthritic joint damage from a radiographic image. Our automatic tool was shown to produce scores with extremely high balanced accuracy within a couple of minutes and utilizing this would remove the subjectivity of the scores between human reviewers.

[125]  arXiv:2110.08817 (cross-list from eess.IV) [pdf]
Title: A deep learning pipeline for localization, differentiation, and uncertainty estimation of liver lesions using multi-phasic and multi-sequence MRI
Comments: 18 pages, 6 figures
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Objectives: to propose a fully-automatic computer-aided diagnosis (CAD) solution for liver lesion characterization, with uncertainty estimation.
Methods: we enrolled 400 patients who had either liver resection or a biopsy and was diagnosed with either hepatocellular carcinoma (HCC), intrahepatic cholangiocarcinoma, or secondary metastasis, from 2006 to 2019. Each patient was scanned with T1WI, T2WI, T1WI venous phase (T2WI-V), T1WI arterial phase (T1WI-A), and DWI MRI sequences. We propose a fully-automatic deep CAD pipeline that localizes lesions from 3D MRI studies using key-slice parsing and provides a confidence measure for its diagnoses. We evaluate using five-fold cross validation and compare performance against three radiologists, including a senior hepatology radiologist, a junior hepatology radiologist and an abdominal radiologist.
Results: the proposed CAD solution achieves a mean F1 score of 0.62, outperforming the abdominal radiologist (0.47), matching the junior hepatology radiologist (0.61), and underperforming the senior hepatology radiologist (0.68). The CAD system can informatively assess its diagnostic confidence, i.e., when only evaluating on the 70% most confident cases the mean f1 score and sensitivity at 80% specificity for HCC vs. others are boosted from 0.62 to 0.71 and 0.84 to 0.92, respectively.
Conclusion: the proposed fully-automatic CAD solution can provide good diagnostic performance with informative confidence assessments in finding and discriminating liver lesions from MRI studies.

[126]  arXiv:2110.08851 (cross-list from cs.LG) [pdf, other]
Title: Self-Supervised Learning for Binary Networks by Joint Classifier Training
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Despite the great success of self-supervised learning with large floating point networks, such networks are not readily deployable to edge devices. To accelerate deployment of models to edge devices for various downstream tasks by unsupervised representation learning, we propose a self-supervised learning method for binary networks. In particular, we propose to use a randomly initialized classifier attached to a pretrained floating point feature extractor as targets and jointly train it with a binary network. For better training of the binary network, we propose a feature similarity loss, a dynamic balancing scheme of loss terms, and modified multi-stage training. We call our method as BSSL. Our empirical validations show that BSSL outperforms self-supervised learning baselines for binary networks in various downstream tasks and outperforms supervised pretraining in certain tasks.

[127]  arXiv:2110.08858 (cross-list from cs.NE) [pdf, other]
Title: Backpropagation with Biologically Plausible Spatio-Temporal Adjustment For Training Deep Spiking Neural Networks
Subjects: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)

The spiking neural network (SNN) mimics the information processing operation in the human brain, represents and transmits information in spike trains containing wealthy spatial and temporal information, and shows superior performance on many cognitive tasks. In addition, the event-driven information processing enables the energy-efficient implementation on neuromorphic chips. The success of deep learning is inseparable from backpropagation. Due to the discrete information transmission, directly applying the backpropagation to the training of the SNN still has a performance gap compared with the traditional deep neural networks. Also, a large simulation time is required to achieve better performance, which results in high latency. To address the problems, we propose a biological plausible spatial adjustment, which rethinks the relationship between membrane potential and spikes and realizes a reasonable adjustment of gradients to different time steps. And it precisely controls the backpropagation of the error along the spatial dimension. Secondly, we propose a biologically plausible temporal adjustment making the error propagate across the spikes in the temporal dimension, which overcomes the problem of the temporal dependency within a single spike period of the traditional spiking neurons. We have verified our algorithm on several datasets, and the experimental results have shown that our algorithm greatly reduces the network latency and energy consumption while also improving network performance. We have achieved state-of-the-art performance on the neuromorphic datasets N-MNIST, DVS-Gesture, and DVS-CIFAR10. For the static datasets MNIST and CIFAR10, we have surpassed most of the traditional SNN backpropagation training algorithm and achieved relatively superior performance.

[128]  arXiv:2110.08977 (cross-list from cs.RO) [pdf, other]
Title: Accurate and Robust Object-oriented SLAM with 3D Quadric Landmark Construction in Outdoor Environment
Comments: Submitting to RA-L
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Object-oriented SLAM is a popular technology in autonomous driving and robotics. In this paper, we propose a stereo visual SLAM with a robust quadric landmark representation method. The system consists of four components, including deep learning detection, object-oriented data association, dual quadric landmark initialization and object-based pose optimization. State-of-the-art quadric-based SLAM algorithms always face observation related problems and are sensitive to observation noise, which limits their application in outdoor scenes. To solve this problem, we propose a quadric initialization method based on the decoupling of the quadric parameters method, which improves the robustness to observation noise. The sufficient object data association algorithm and object-oriented optimization with multiple cues enables a highly accurate object pose estimation that is robust to local observations. Experimental results show that the proposed system is more robust to observation noise and significantly outperforms current state-of-the-art methods in outdoor environments. In addition, the proposed system demonstrates real-time performance.

[129]  arXiv:2110.09022 (cross-list from cs.LG) [pdf, other]
Title: Demystifying How Self-Supervised Features Improve Training from Noisy Labels
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

The advancement of self-supervised learning (SSL) motivates researchers to apply SSL on other tasks such as learning with noisy labels. Recent literature indicates that methods built on SSL features can substantially improve the performance of learning with noisy labels. Nonetheless, the deeper reasons why (and how) SSL features benefit the training from noisy labels are less understood. In this paper, we study why and how self-supervised features help networks resist label noise using both theoretical analyses and numerical experiments. Our result shows that, given a quality encoder pre-trained from SSL, a simple linear layer trained by the cross-entropy loss is theoretically robust to symmetric label noise. Further, we provide insights for how knowledge distilled from SSL features can alleviate the over-fitting problem. We hope our work provides a better understanding for learning with noisy labels from the perspective of self-supervised learning and can potentially serve as a guideline for further research. Code is available at github.com/UCSC-REAL/SelfSup_NoisyLabel.

[130]  arXiv:2110.09023 (cross-list from cs.LG) [pdf, other]
Title: Utilizing Active Machine Learning for Quality Assurance: A Case Study of Virtual Car Renderings in the Automotive Industry
Comments: Hawaii International Conference on System Sciences 2022 (HICSS-55)
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Computer-generated imagery of car models has become an indispensable part of car manufacturers' advertisement concepts. They are for instance used in car configurators to offer customers the possibility to configure their car online according to their personal preferences. However, human-led quality assurance faces the challenge to keep up with high-volume visual inspections due to the car models' increasing complexity. Even though the application of machine learning to many visual inspection tasks has demonstrated great success, its need for large labeled data sets remains a central barrier to using such systems in practice. In this paper, we propose an active machine learning-based quality assurance system that requires significantly fewer labeled instances to identify defective virtual car renderings without compromising performance. By employing our system at a German automotive manufacturer, start-up difficulties can be overcome, the inspection process efficiency can be increased, and thus economic advantages can be realized.

[131]  arXiv:2110.09113 (cross-list from eess.IV) [pdf]
Title: Salt and pepper noise removal method based on stationary Framelet transform with non-convex sparsity regularization
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Salt and pepper noise removal is a common inverse problem in image processing, and it aims to restore image information with high quality. Traditional salt and pepper denoising methods have two limitations. First, noise characteristics are often not described accurately. For example, the noise location information is often ignored and the sparsity of the salt and pepper noise is often described by L1 norm, which cannot illustrate the sparse variables clearly. Second, conventional methods separate the contaminated image into a recovered image and a noise part, thus resulting in recovering an image with unsatisfied smooth parts and detail parts. In this study, we introduce a noise detection strategy to determine the position of the noise, and a non-convex sparsity regularization depicted by Lp quasi-norm is employed to describe the sparsity of the noise, thereby addressing the first limitation. The morphological component analysis framework with stationary Framelet transform is adopted to decompose the processed image into cartoon, texture, and noise parts to resolve the second limitation. In this framework, the stationary Framelet regularizations with different parameters control the restoration of the cartoon and texture parts. In this way, the two parts are recovered separately to avoid mutual interference. Then, the alternating direction method of multipliers (ADMM) is employed to solve the proposed model. Finally, experiments are conducted to verify the proposed method and compare it with some current state-of-the-art denoising methods. The experimental results show that the proposed method can remove salt and pepper noise while preserving the details of the processed image.

[132]  arXiv:2110.09134 (cross-list from eess.IV) [pdf, other]
Title: GAN-based disentanglement learning for chest X-ray rib suppression
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Clinical evidence has shown that rib-suppressed chest X-rays (CXRs) can improve the reliability of pulmonary disease diagnosis. However, previous approaches on generating rib-suppressed CXR face challenges in preserving details and eliminating rib residues. We hereby propose a GAN-based disentanglement learning framework called Rib Suppression GAN, or RSGAN, to perform rib suppression by utilizing the anatomical knowledge embedded in unpaired computed tomography (CT) images. In this approach, we employ a residual map to characterize the intensity difference between CXR and the corresponding rib-suppressed result. To predict the residual map in CXR domain, we disentangle the image into structure- and contrast-specific features and transfer the rib structural priors from digitally reconstructed radiographs (DRRs) computed by CT. Furthermore, we employ additional adaptive loss to suppress rib residue and preserve more details. We conduct extensive experiments based on 1,673 CT volumes, and four benchmarking CXR datasets, totaling over 120K images, to demonstrate that (i) our proposed RSGAN achieves superior image quality compared to the state-of-the-art rib suppression methods; (ii) combining CXR with our rib-suppressed result leads to better performance in lung disease classification and tuberculosis area detection.

[133]  arXiv:2110.09148 (cross-list from eess.IV) [pdf, other]
Title: Body Part Regression for CT Images
Authors: Sarah Schuhegger
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

One of the greatest challenges in the medical imaging domain is to successfully transfer deep learning models into clinical practice. Since models are often trained on a specific body region, a robust transfer into the clinic necessitates the selection of images with body regions that fit the algorithm to avoid false-positive predictions in unknown regions. Due to the insufficient and inaccurate nature of manually-defined imaging meta-data, automated body part recognition is a key ingredient towards the broad and reliable adoption of medical deep learning models. While some approaches to this task have been presented in the past, building and evaluating robust algorithms for fine-grained body part recognition remains challenging. So far, no easy-to-use method exists to determine the scanned body range of medical Computed Tomography (CT) volumes. In this thesis, a self-supervised body part regression model for CT volumes is developed and trained on a heterogeneous collection of CT studies. Furthermore, it is demonstrated how the algorithm can contribute to the robust and reliable transfer of medical models into the clinic. Finally, easy application of the developed method is ensured by integrating it into the medical platform toolkit Kaapana and providing it as a python package at https://github.com/MIC-DKFZ/BodyPartRegression .

[134]  arXiv:2110.09192 (cross-list from cs.LG) [pdf, other]
Title: Learning Optimal Conformal Classifiers
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME); Machine Learning (stat.ML)

Modern deep learning based classifiers show very high accuracy on test data but this does not provide sufficient guarantees for safe deployment, especially in high-stake AI applications such as medical diagnosis. Usually, predictions are obtained without a reliable uncertainty estimate or a formal guarantee. Conformal prediction (CP) addresses these issues by using the classifier's probability estimates to predict confidence sets containing the true class with a user-specified probability. However, using CP as a separate processing step after training prevents the underlying model from adapting to the prediction of confidence sets. Thus, this paper explores strategies to differentiate through CP during training with the goal of training model with the conformal wrapper end-to-end. In our approach, conformal training (ConfTr), we specifically "simulate" conformalization on mini-batches during training. We show that CT outperforms state-of-the-art CP methods for classification by reducing the average confidence set size (inefficiency). Moreover, it allows to "shape" the confidence sets predicted at test time, which is difficult for standard CP. On experiments with several datasets, we show ConfTr can influence how inefficiency is distributed across classes, or guide the composition of confidence sets in terms of the included classes, while retaining the guarantees offered by CP.

[135]  arXiv:2110.09273 (cross-list from cs.CY) [pdf, other]
Title: SafeAccess+: An Intelligent System to make Smart Home Safer and Americans with Disability Act Compliant
Authors: Shahinur Alam
Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

Smart homes are becoming ubiquitous, but they are not Americans with Disability Act (ADA) compliant. Smart homes equipped with ADA compliant appliances and services are critical for people with disabilities (i.e., visual impairments and limited mobility) to improve independence, safety, and quality of life. Despite all advancements in smart home technologies, some fundamental design and implementation issues remain. For example, people with disabilities often feel insecure to respond when someone knocks on the door or rings the doorbell. In this paper, we present an intelligent system called "SafeAccess+" to build safer and ADA compliant premises (e.g. smart homes, offices). The key functionalities of the SafeAccess+ are: 1) Monitoring the inside/outside of premises and identifying incoming people; 2) Providing users relevant information to assess incoming threats (e.g., burglary, robbery) and ongoing crimes 3) Allowing users to grant safe access to homes for friends/family members. We have addressed several technical and research challenges: - developing models to detect and recognize person/activity, generating image descriptions, designing ADA compliant end-end system. In addition, we have designed a prototype smart door showcasing the proof-of-concept. The premises are expected to be equipped with cameras placed in strategic locations that facilitate monitoring the premise 24/7 to identify incoming persons and to generate image descriptions. The system generates a pre-structured message from the image description to assess incoming threats and immediately notify the users. The completeness and generalization of models have been ensured through a rigorous quantitative evaluation. The users' satisfaction and reliability of the system has been measured using PYTHEIA scale and was rated excellent (Internal Consistency-Cronbach's alpha is 0.784, Test-retest reliability is 0.939 )

[136]  arXiv:2110.09288 (cross-list from eess.IV) [pdf, other]
Title: CT-SGAN: Computed Tomography Synthesis GAN
Comments: In Proceedings of MICCAI Deep Generative Models workshop, October 2021
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Diversity in data is critical for the successful training of deep learning models. Leveraged by a recurrent generative adversarial network, we propose the CT-SGAN model that generates large-scale 3D synthetic CT-scan volumes ($\geq 224\times224\times224$) when trained on a small dataset of chest CT-scans. CT-SGAN offers an attractive solution to two major challenges facing machine learning in medical imaging: a small number of given i.i.d. training data, and the restrictions around the sharing of patient data preventing to rapidly obtain larger and more diverse datasets. We evaluate the fidelity of the generated images qualitatively and quantitatively using various metrics including Fr\'echet Inception Distance and Inception Score. We further show that CT-SGAN can significantly improve lung nodule detection accuracy by pre-training a classifier on a vast amount of synthetic data.

[137]  arXiv:2110.09294 (cross-list from eess.IV) [pdf]
Title: Comparative Analysis of Deep Learning Algorithms for Classification of COVID-19 X-Ray Images
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

The Coronavirus was first emerged in December, in the city of China named Wuhan in 2019 and spread quickly all over the world. It has very harmful effects all over the global economy, education, social, daily living and general health of humans. To restrict the quick expansion of the disease initially, main difficulty is to explore the positive corona patients as quickly as possible. As there are no automatic tool kits accessible the requirement for supplementary diagnostic tools has risen up. Previous studies have findings acquired from radiological techniques proposed that this kind of images have important details related to the coronavirus. The usage of modified Artificial Intelligence (AI) system in combination with radio-graphical images can be fruitful for the precise and exact solution of this virus and can also be helpful to conquer the issue of deficiency of professional physicians in distant villages. In our research, we analyze the different techniques for the detection of COVID-19 using X-Ray radiographic images of the chest, we examined the different pre-trained CNN models AlexNet, VGG-16, MobileNet-V2, SqeezeNet, ResNet-34, ResNet-50 and COVIDX-Net to correct analytics for classification system of COVID-19. Our study shows that the pre trained CNN Model with ResNet-34 technique gives the higher accuracy rate of 98.33, 96.77% precision, and 98.36 F1-score, which is better than other CNN techniques. Our model may be helpful for the researchers to fine train the CNN model for the the quick screening of COVID patients.

[138]  arXiv:2110.09302 (cross-list from cs.LG) [pdf, other]
Title: A Prior Guided Adversarial Representation Learning and Hypergraph Perceptual Network for Predicting Abnormal Connections of Alzheimer's Disease
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Alzheimer's disease is characterized by alterations of the brain's structural and functional connectivity during its progressive degenerative processes. Existing auxiliary diagnostic methods have accomplished the classification task, but few of them can accurately evaluate the changing characteristics of brain connectivity. In this work, a prior guided adversarial representation learning and hypergraph perceptual network (PGARL-HPN) is proposed to predict abnormal brain connections using triple-modality medical images. Concretely, a prior distribution from the anatomical knowledge is estimated to guide multimodal representation learning using an adversarial strategy. Also, the pairwise collaborative discriminator structure is further utilized to narrow the difference of representation distribution. Moreover, the hypergraph perceptual network is developed to effectively fuse the learned representations while establishing high-order relations within and between multimodal images. Experimental results demonstrate that the proposed model outperforms other related methods in analyzing and predicting Alzheimer's disease progression. More importantly, the identified abnormal connections are partly consistent with the previous neuroscience discoveries. The proposed model can evaluate characteristics of abnormal brain connections at different stages of Alzheimer's disease, which is helpful for cognitive disease study and early treatment.

[139]  arXiv:2110.09305 (cross-list from eess.IV) [pdf, ps, other]
Title: Vit-GAN: Image-to-image Translation with Vision Transformes and Conditional GANS
Authors: Yiğit Gündüç
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

In this paper, we have developed a general-purpose architecture, Vit-Gan, capable of performing most of the image-to-image translation tasks from semantic image segmentation to single image depth perception. This paper is a follow-up paper, an extension of generator-based model [1] in which the obtained results were very promising. This opened the possibility of further improvements with adversarial architecture. We used a unique vision transformers-based generator architecture and Conditional GANs(cGANs) with a Markovian Discriminator (PatchGAN) (https://github.com/YigitGunduc/vit-gan). In the present work, we use images as conditioning arguments. It is observed that the obtained results are more realistic than the commonly used architectures.

[140]  arXiv:2110.09318 (cross-list from cs.MM) [pdf]
Title: Mixed Reality using Illumination-aware Gradient Mixing in Surgical Telepresence: Enhanced Multi-layer Visualization
Comments: 24 pages
Journal-ref: Multimedia Tools and Applications, 2021
Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Background and aim: Surgical telepresence using augmented perception has been applied, but mixed reality is still being researched and is only theoretical. The aim of this work is to propose a solution to improve the visualization in the final merged video by producing globally consistent videos when the intensity of illumination in the input source and target video varies. Methodology: The proposed system uses an enhanced multi-layer visualization with illumination-aware gradient mixing using Illumination Aware Video Composition algorithm. Particle Swarm Optimization Algorithm is used to find the best sample pair from foreground and background region and image pixel correlation to estimate the alpha matte. Particle Swarm Optimization algorithm helps to get the original colour and depth of the unknown pixel in the unknown region. Result: Our results showed improved accuracy caused by reducing the Mean squared Error for selecting the best sample pair for unknown region in 10 each sample for bowel, jaw and breast. The amount of this reduction is 16.48% from the state of art system. As a result, the visibility accuracy is improved from 89.4 to 97.7% which helped to clear the hand vision even in the difference of light. Conclusion: Illumination effect and alpha pixel correlation improves the visualization accuracy and produces a globally consistent composition results and maintains the temporal coherency when compositing two videos with high and inverse illumination effect. In addition, this paper provides a solution for selecting the best sampling pair for the unknown region to obtain the original colour and depth.

[141]  arXiv:2110.09319 (cross-list from eess.IV) [pdf, other]
Title: Incremental Cross-Domain Adaptation for Robust Retinopathy Screening via Bayesian Deep Learning
Comments: Accepted in IEEE Transactions on Instrumentation and Measurement. Source code is available at this https URL
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Retinopathy represents a group of retinal diseases that, if not treated timely, can cause severe visual impairments or even blindness. Many researchers have developed autonomous systems to recognize retinopathy via fundus and optical coherence tomography (OCT) imagery. However, most of these frameworks employ conventional transfer learning and fine-tuning approaches, requiring a decent amount of well-annotated training data to produce accurate diagnostic performance. This paper presents a novel incremental cross-domain adaptation instrument that allows any deep classification model to progressively learn abnormal retinal pathologies in OCT and fundus imagery via few-shot training. Furthermore, unlike its competitors, the proposed instrument is driven via a Bayesian multi-objective function that not only enforces the candidate classification network to retain its prior learned knowledge during incremental training but also ensures that the network understands the structural and semantic relationships between previously learned pathologies and newly added disease categories to effectively recognize them at the inference stage. The proposed framework, evaluated on six public datasets acquired with three different scanners to screen thirteen retinal pathologies, outperforms the state-of-the-art competitors by achieving an overall accuracy and F1 score of 0.9826 and 0.9846, respectively.

[142]  arXiv:2110.09327 (cross-list from cs.LG) [pdf, other]
Title: Self-Supervised Representation Learning: Introduction, Advances and Challenges
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Self-supervised representation learning methods aim to provide powerful deep feature learning without the requirement of large annotated datasets, thus alleviating the annotation bottleneck that is one of the main barriers to practical deployment of deep learning today. These methods have advanced rapidly in recent years, with their efficacy approaching and sometimes surpassing fully supervised pre-training alternatives across a variety of data modalities including image, video, sound, text and graphs. This article introduces this vibrant area including key concepts, the four main families of approach and associated state of the art, and how self-supervised methods are applied to diverse modalities of data. We further discuss practical considerations including workflows, representation transferability, and compute cost. Finally, we survey the major open challenges in the field that provide fertile ground for future work.

[143]  arXiv:2110.09354 (cross-list from eess.IV) [pdf, other]
Title: An Analysis and Implementation of the HDR+ Burst Denoising Method
Comments: 28 pages, 15 figures, published at this https URL, code on this https URL
Journal-ref: Image Processing On Line, 11 (2021), pp. 142-169
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

HDR+ is an image processing pipeline presented by Google in 2016. At its core lies a denoising algorithm that uses a burst of raw images to produce a single higher quality image. Since it is designed as a versatile solution for smartphone cameras, it does not necessarily aim for the maximization of standard denoising metrics, but rather for the production of natural, visually pleasing images. In this article, we specifically discuss and analyze the HDR+ burst denoising algorithm architecture and the impact of its various parameters. With this publication, we provide an open source Python implementation of the algorithm, along with an interactive demo.

[144]  arXiv:2110.09383 (cross-list from cs.AI) [pdf, other]
Title: Neuro-Symbolic Forward Reasoning
Comments: Preprint
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Reasoning is an essential part of human intelligence and thus has been a long-standing goal in artificial intelligence research. With the recent success of deep learning, incorporating reasoning with deep learning systems, i.e., neuro-symbolic AI has become a major field of interest. We propose the Neuro-Symbolic Forward Reasoner (NSFR), a new approach for reasoning tasks taking advantage of differentiable forward-chaining using first-order logic. The key idea is to combine differentiable forward-chaining reasoning with object-centric (deep) learning. Differentiable forward-chaining reasoning computes logical entailments smoothly, i.e., it deduces new facts from given facts and rules in a differentiable manner. The object-centric learning approach factorizes raw inputs into representations in terms of objects. Thus, it allows us to provide a consistent framework to perform the forward-chaining inference from raw inputs. NSFR factorizes the raw inputs into the object-centric representations, converts them into probabilistic ground atoms, and finally performs differentiable forward-chaining inference using weighted rules for inference. Our comprehensive experimental evaluations on object-centric reasoning data sets, 2D Kandinsky patterns and 3D CLEVR-Hans, and a variety of tasks show the effectiveness and advantage of our approach.

[145]  arXiv:2110.09384 (cross-list from eess.IV) [pdf]
Title: Automatic Detection of COVID-19 and Pneumonia from Chest X-Ray using Deep Learning
Authors: Sarath Pathari
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

In this study, a dataset of X-ray images from patients with common viral pneumonia, bacterial pneumonia, confirmed Covid-19 disease was utilized for the automatic detection of the Coronavirus disease. The point of the investigation is to assess the exhibition of cutting edge convolutional neural system structures proposed over the ongoing years for clinical picture order. In particular, the system called Transfer Learning was received. With transfer learning, the location of different variations from the norm in little clinical picture datasets is a reachable objective, regularly yielding amazing outcomes. The datasets used in this trial. Firstly, a collection of 24000 X-ray images includes 6000 images for confirmed Covid-19 disease,6000 confirmed common bacterial pneumonia and 6000 images of normal conditions. The information was gathered and expanded from the accessible X-Ray pictures on open clinical stores. The outcomes recommend that Deep Learning with X-Ray imaging may separate noteworthy biological markers identified with the Covid-19 sickness, while the best precision, affectability, and particularity acquired is 97.83%, 96.81%, and 98.56% individually.

[146]  arXiv:2110.09431 (cross-list from cs.HC) [pdf, other]
Title: Comparing Deep Neural Nets with UMAP Tour
Subjects: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Neural networks should be interpretable to humans. In particular, there is a growing interest in concepts learned in a layer and similarity between layers. In this work, a tool, UMAP Tour, is built to visually inspect and compare internal behavior of real-world neural network models using well-aligned, instance-level representations. The method used in the visualization also implies a new similarity measure between neural network layers. Using the visual tool and the similarity measure, we find concepts learned in state-of-the-art models and dissimilarities between them, such as GoogLeNet and ResNet.

[147]  arXiv:2110.09468 (cross-list from cs.LG) [pdf, other]
Title: Improving Robustness using Generated Data
Comments: Accepted at NeurIPS 2021
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Recent work argues that robust training requires substantially larger datasets than those required for standard classification. On CIFAR-10 and CIFAR-100, this translates into a sizable robust-accuracy gap between models trained solely on data from the original training set and those trained with additional data extracted from the "80 Million Tiny Images" dataset (TI-80M). In this paper, we explore how generative models trained solely on the original training set can be leveraged to artificially increase the size of the original training set and improve adversarial robustness to $\ell_p$ norm-bounded perturbations. We identify the sufficient conditions under which incorporating additional generated data can improve robustness, and demonstrate that it is possible to significantly reduce the robust-accuracy gap to models trained with additional real data. Surprisingly, we even show that even the addition of non-realistic random data (generated by Gaussian sampling) can improve robustness. We evaluate our approach on CIFAR-10, CIFAR-100, SVHN and TinyImageNet against $\ell_\infty$ and $\ell_2$ norm-bounded perturbations of size $\epsilon = 8/255$ and $\epsilon = 128/255$, respectively. We show large absolute improvements in robust accuracy compared to previous state-of-the-art methods. Against $\ell_\infty$ norm-bounded perturbations of size $\epsilon = 8/255$, our models achieve 66.10% and 33.49% robust accuracy on CIFAR-10 and CIFAR-100, respectively (improving upon the state-of-the-art by +8.96% and +3.29%). Against $\ell_2$ norm-bounded perturbations of size $\epsilon = 128/255$, our model achieves 78.31% on CIFAR-10 (+3.81%). These results beat most prior works that use external data.

[148]  arXiv:2110.09473 (cross-list from eess.IV) [pdf, other]
Title: DBSegment: Fast and robust segmentation of deep brain structures -- Evaluation of transportability across acquisition domains
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)

Segmenting deep brain structures from magnetic resonance images is important for patient diagnosis, surgical planning, and research. Most current state-of-the-art solutions follow a segmentation-by-registration approach, where subject MRIs are mapped to a template with well-defined segmentations. However, registration-based pipelines are time-consuming, thus, limiting their clinical use. This paper uses deep learning to provide a robust and efficient deep brain segmentation solution. The method consists of a pre-processing step to conform all MRI images to the same orientation, followed by a convolutional neural network using the nnU-Net framework. We use a total of 14 datasets from both research and clinical collections. Of these, seven were used for training and validation and seven were retained for independent testing. We trained the network to segment 30 deep brain structures, as well as a brain mask, using labels generated from a registration-based approach. We evaluated the generalizability of the network by performing a leave-one-dataset-out cross-validation, and extensive testing on external datasets. Furthermore, we assessed cross-domain transportability by evaluating the results separately on different domains. We achieved an average DSC of 0.89 $\pm$ 0.04 on the independent testing datasets when compared to the registration-based gold standard. On our test system, the computation time decreased from 42 minutes for a reference registration-based pipeline to 1 minute. Our proposed method is fast, robust, and generalizes with high reliability. It can be extended to the segmentation of other brain structures. The method is publicly available on GitHub, as well as a pip package for convenient usage.

[149]  arXiv:2110.09485 (cross-list from cs.LG) [pdf, other]
Title: Learning in High Dimension Always Amounts to Extrapolation
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

The notion of interpolation and extrapolation is fundamental in various fields from deep learning to function approximation. Interpolation occurs for a sample $x$ whenever this sample falls inside or on the boundary of the given dataset's convex hull. Extrapolation occurs when $x$ falls outside of that convex hull. One fundamental (mis)conception is that state-of-the-art algorithms work so well because of their ability to correctly interpolate training data. A second (mis)conception is that interpolation happens throughout tasks and datasets, in fact, many intuitions and theories rely on that assumption. We empirically and theoretically argue against those two points and demonstrate that on any high-dimensional ($>$100) dataset, interpolation almost surely never happens. Those results challenge the validity of our current interpolation/extrapolation definition as an indicator of generalization performances.

[150]  arXiv:2110.09506 (cross-list from cs.LG) [pdf, other]
Title: MEMO: Test Time Robustness via Adaptation and Augmentation
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

While deep neural networks can attain good accuracy on in-distribution test points, many applications require robustness even in the face of unexpected perturbations in the input, changes in the domain, or other sources of distribution shift. We study the problem of test time robustification, i.e., using the test input to improve model robustness. Recent prior works have proposed methods for test time adaptation, however, they each introduce additional assumptions, such as access to multiple test points, that prevent widespread adoption. In this work, we aim to study and devise methods that make no assumptions about the model training process and are broadly applicable at test time. We propose a simple approach that can be used in any test setting where the model is probabilistic and adaptable: when presented with a test example, perform different data augmentations on the data point, and then adapt (all of) the model parameters by minimizing the entropy of the model's average, or marginal, output distribution across the augmentations. Intuitively, this objective encourages the model to make the same prediction across different augmentations, thus enforcing the invariances encoded in these augmentations, while also maintaining confidence in its predictions. In our experiments, we demonstrate that this approach consistently improves robust ResNet and vision transformer models, achieving accuracy gains of 1-8% over standard model evaluation and also generally outperforming prior augmentation and adaptation strategies. We achieve state-of-the-art results for test shifts caused by image corruptions (ImageNet-C), renditions of common objects (ImageNet-R), and, among ResNet-50 models, adversarially chosen natural examples (ImageNet-A).

[151]  arXiv:2110.09514 (cross-list from cs.LG) [pdf, other]
Title: Discovering and Achieving Goals via World Models
Comments: NeurIPS 2021. First two authors contributed equally. Website at this https URL
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Machine Learning (stat.ML)

How can artificial agents learn to solve many diverse tasks in complex visual environments in the absence of any supervision? We decompose this question into two problems: discovering new goals and learning to reliably achieve them. We introduce Latent Explorer Achiever (LEXA), a unified solution to these that learns a world model from image inputs and uses it to train an explorer and an achiever policy from imagined rollouts. Unlike prior methods that explore by reaching previously visited states, the explorer plans to discover unseen surprising states through foresight, which are then used as diverse targets for the achiever to practice. After the unsupervised phase, LEXA solves tasks specified as goal images zero-shot without any additional learning. LEXA substantially outperforms previous approaches to unsupervised goal-reaching, both on prior benchmarks and on a new challenging benchmark with a total of 40 test tasks spanning across four standard robotic manipulation and locomotion domains. LEXA further achieves goals that require interacting with multiple objects in sequence. Finally, to demonstrate the scalability and generality of LEXA, we train a single general agent across four distinct environments. Code and videos at https://orybkin.github.io/lexa/

Replacements for Tue, 19 Oct 21

[152]  arXiv:1904.02422 (replaced) [pdf, other]
Title: Resource Efficient 3D Convolutional Neural Networks
Comments: Accepted to ICCV 2019 workshop - Neural Architects
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[153]  arXiv:1911.06644 (replaced) [pdf, other]
Title: You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[154]  arXiv:2003.12739 (replaced) [pdf, other]
Title: Modulating Bottom-Up and Top-Down Visual Processing via Language-Conditional Filters
Comments: 13 pages, 6 figures, currently under review for IEEE Transactions on Multimedia
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
[155]  arXiv:2004.05973 (replaced) [pdf, other]
Title: Speak2Label: Using Domain Knowledge for Creating a Large Scale Driver Gaze Zone Estimation Dataset
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[156]  arXiv:2007.03408 (replaced) [pdf, other]
Title: A Generative Model for Texture Synthesis based on Optimal Transport between Feature Distributions
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
[157]  arXiv:2009.14639 (replaced) [pdf, other]
Title: Dissected 3D CNNs: Temporal Skip Connections for Efficient Online Video Processing
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[158]  arXiv:2010.07217 (replaced) [pdf, other]
Title: Back to the Future: Cycle Encoding Prediction for Self-supervised Contrastive Video Representation Learning
Comments: accepted at BMVC
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[159]  arXiv:2011.02697 (replaced) [pdf, other]
Title: Center-wise Local Image Mixture For Contrastive Representation Learning
Comments: Accepted by BMVC2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[160]  arXiv:2011.09315 (replaced) [pdf, ps, other]
Title: End-to-End Object Detection with Adaptive Clustering Transformer
Comments: BMVC 2021 Oral
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[161]  arXiv:2011.13341 (replaced) [pdf, other]
Title: 4D Human Body Capture from Egocentric Video via 3D Scene Grounding
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[162]  arXiv:2012.01230 (replaced) [pdf, other]
Title: Curiosity-driven 3D Object Detection Without Labels
Comments: 19 pages, 17 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[163]  arXiv:2012.01415 (replaced) [pdf, other]
Title: Prototype-based Incremental Few-Shot Semantic Segmentation
Comments: Accepted at BMVC 2021 (Poster)
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[164]  arXiv:2012.01988 (replaced) [pdf, other]
Title: Wisdom of Committees: An Overlooked Approach To Faster and More Accurate Models
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[165]  arXiv:2012.05590 (replaced) [pdf, other]
Title: An Asynchronous Kalman Filter for Hybrid Event Cameras
Comments: 12 pages, 6 figures, published in International Conference on Computer Vision (ICCV) 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[166]  arXiv:2012.05657 (replaced) [pdf, other]
Title: Geometric Adversarial Attacks and Defenses on 3D Point Clouds
Comments: 3DV 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[167]  arXiv:2012.09036 (replaced) [pdf, other]
Title: Improved StyleGAN Embedding: Where are the Good Latents?
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
[168]  arXiv:2012.11772 (replaced) [pdf, other]
Title: Power-SLIC: Fast Superpixel Segmentations by Diagrams
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG); Machine Learning (cs.LG)
[169]  arXiv:2101.04407 (replaced) [pdf, other]
Title: FaceX-Zoo: A PyTorch Toolbox for Face Recognition
Comments: add more backbone comparisons
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[170]  arXiv:2101.07338 (replaced) [pdf, other]
Title: Improving Makeup Face Verification by Exploring Part-Based Representations
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[171]  arXiv:2101.12677 (replaced) [pdf, other]
Title: Diminishing Domain Bias by Leveraging Domain Labels in Object Detection on UAVs
Comments: Accepted for publication at ICAR 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[172]  arXiv:2102.07318 (replaced) [pdf, other]
Title: A Global to Local Double Embedding Method for Multi-person Pose Estimation
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[173]  arXiv:2102.09332 (replaced) [pdf, other]
Title: HVAQ: A High-Resolution Vision-Based Air Quality Dataset
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[174]  arXiv:2103.01946 (replaced) [pdf, other]
Title: Fixing Data Augmentation to Improve Adversarial Robustness
Comments: Since its original publication (2 Mar 2021), this paper has been accepted to NeurIPS 2021 as two separate and updated papers (Rebuffi et al., 2021; Gowal et al., 2021). The new papers improve results and clarity
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[175]  arXiv:2103.11099 (replaced) [pdf, other]
Title: Local Patch AutoAugment with Multi-Agent Collaboration
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[176]  arXiv:2103.12459 (replaced) [pdf, other]
Title: Dual Mesh Convolutional Networks for Human Shape Correspondence
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[177]  arXiv:2103.13538 (replaced) [pdf, other]
Title: Hierarchical Proxy-based Loss for Deep Metric Learning
Comments: Accepted to WACV2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[178]  arXiv:2104.03133 (replaced) [pdf, other]
Title: Image Composition Assessment with Saliency-augmented Multi-pattern Pooling
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[179]  arXiv:2104.04724 (replaced) [pdf, other]
Title: Occlusion Guided Self-supervised Scene Flow Estimation on 3D Point Clouds
Comments: Accepted at 3DV 2021 (Poster)
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[180]  arXiv:2104.04968 (replaced) [pdf, other]
Title: Knowledge-Augmented Contrastive Learning for Abnormality Classification and Localization in Chest X-rays with Radiomics using a Feedback Loop
Comments: Accepted by WACV 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[181]  arXiv:2104.10122 (replaced) [pdf, other]
Title: Improving state-of-the-art in Detecting Student Engagement with Resnet and TCN Hybrid Network
Comments: 7 pages, 3 figures, 1 table
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
[182]  arXiv:2104.10490 (replaced) [pdf, other]
Title: FIERY: Future Instance Prediction in Bird's-Eye View from Surround Monocular Cameras
Comments: ICCV 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[183]  arXiv:2104.11746 (replaced) [pdf, other]
Title: VidTr: Video Transformer Without Convolutions
Comments: ICCV 2021 Accepted
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[184]  arXiv:2105.05301 (replaced) [pdf, other]
Title: Collaborative Regression of Expressive Bodies using Moderation
Comments: 21 pages. The first two authors contributed equally to this work
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[185]  arXiv:2105.11705 (replaced) [pdf, other]
Title: SBEVNet: End-to-End Deep Stereo Layout Estimation
Comments: WACV 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[186]  arXiv:2105.15089 (replaced) [pdf, other]
Title: Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[187]  arXiv:2106.01401 (replaced) [pdf, other]
Title: Container: Context Aggregation Network
Comments: NeuIPS 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[188]  arXiv:2106.01505 (replaced) [pdf, other]
Title: Barbershop: GAN-based Image Compositing using Segmentation Masks
Comments: Project page: this https URL Video: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
[189]  arXiv:2106.02514 (replaced) [pdf, other]
Title: The Image Local Autoregressive Transformer
Comments: Accepted by NeurIPS2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
[190]  arXiv:2106.02898 (replaced) [pdf, other]
Title: Dynamic Resolution Network
Comments: Accepted by NeurIPS 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[191]  arXiv:2106.04427 (replaced) [pdf, other]
Title: On the relation between statistical learning and perceptual distances
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Neurons and Cognition (q-bio.NC)
[192]  arXiv:2106.12423 (replaced) [pdf, other]
Title: Alias-Free Generative Adversarial Networks
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
[193]  arXiv:2106.14118 (replaced) [pdf, other]
Title: Hear Me Out: Fusional Approaches for Audio Augmented Temporal Action Localization
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[194]  arXiv:2107.04281 (replaced) [pdf, other]
Title: JPGNet: Joint Predictive Filtering and Generative Network for Image Inpainting
Comments: This work has been accepted to ACM-MM 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[195]  arXiv:2107.06501 (replaced) [pdf, other]
Title: AdvFilter: Predictive Perturbation-aware Filtering against Adversarial Attack via Multi-domain Learning
Comments: This work has been accepted to ACM-MM 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[196]  arXiv:2107.07030 (replaced) [pdf, other]
Title: Diff-Net: Image Feature Difference based High-Definition Map Change Detection for Autonomous Driving
Comments: 13 pages, 4 figures. Fixed typos, added more explanations to figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[197]  arXiv:2107.12085 (replaced) [pdf, other]
Title: Learning to Adversarially Blur Visual Object Tracking
Comments: This work has been accepted to ICCV 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[198]  arXiv:2108.04228 (replaced) [pdf, other]
Title: Iterative Distillation for Better Uncertainty Estimates in Multitask Emotion Recognition
Comments: Accepted as a Workshop paper in ICCV2021 proceeding
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[199]  arXiv:2108.07917 (replaced) [pdf]
Title: Classification of Abnormal Hand Movement for Aiding in Autism Detection: Machine Learning Study
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[200]  arXiv:2108.10831 (replaced) [pdf, other]
Title: LLVIP: A Visible-infrared Paired Dataset for Low-light Vision
Comments: 9 pages, 9 figures, ICCV workshop
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[201]  arXiv:2109.01745 (replaced) [pdf, other]
Title: A realistic approach to generate masked faces applied on two novel masked face recognition data sets
Comments: Accepted at NeurIPS 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[202]  arXiv:2109.05211 (replaced) [pdf, other]
Title: RobustART: Benchmarking Robustness on Architecture Design and Training Techniques
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[203]  arXiv:2109.05675 (replaced) [pdf, other]
Title: Online Unsupervised Learning of Visual Representations and Categories
Comments: Technical report, 28 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
[204]  arXiv:2109.09704 (replaced) [pdf, other]
Title: BabelCalib: A Universal Approach to Calibrating Central Cameras
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[205]  arXiv:2110.00970 (replaced) [pdf, other]
Title: Semantic-Guided Zero-Shot Learning for Low-Light Image/Video Enhancement
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[206]  arXiv:2110.01571 (replaced) [pdf, other]
Title: Learning Causal Representation for Face Transfer across Large Appearance Gap
Subjects: Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME)
[207]  arXiv:2110.02651 (replaced) [pdf, other]
Title: Weak Novel Categories without Tears: A Survey on Weak-Shot Learning
Authors: Li Niu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[208]  arXiv:2110.03016 (replaced) [pdf, other]
Title: DeepBBS: Deep Best Buddies for Point Cloud Registration
Comments: Accepted to 3DV 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[209]  arXiv:2110.04076 (replaced) [pdf, other]
Title: Self-supervised Point Cloud Prediction Using 3D Spatio-temporal Convolutional Networks
Comments: Accepted for CoRL 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[210]  arXiv:2110.04764 (replaced) [pdf, other]
Title: Deep learning-based person re-identification methods: A survey and outlook of recent works
Comments: 21 pages, 13 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[211]  arXiv:2110.05319 (replaced) [pdf, other]
Title: Efficient Training of 3D Seismic Image Fault Segmentation Network under Sparse Labels by Weakening Anomaly Annotation
Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Geophysics (physics.geo-ph)
[212]  arXiv:2110.05668 (replaced) [pdf, other]
Title: NAS-Bench-360: Benchmarking Diverse Tasks for Neural Architecture Search
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[213]  arXiv:2110.05717 (replaced) [pdf, other]
Title: Relation-aware Video Reading Comprehension for Temporal Language Grounding
Comments: Accepted by EMNLP-21
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
[214]  arXiv:2110.06382 (replaced) [pdf]
Title: A Survey of Open Source User Activity Traces with Applications to User Mobility Characterization and Modeling
Comments: 23 pages, 6 pages references
Subjects: Computer Vision and Pattern Recognition (cs.CV); Social and Information Networks (cs.SI)
[215]  arXiv:2110.06607 (replaced) [pdf, other]
Title: THOMAS: Trajectory Heatmap Output with learned Multi-Agent Sampling
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[216]  arXiv:2110.06635 (replaced) [pdf, other]
Title: ADOP: Approximate Differentiable One-Pixel Point Rendering
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
[217]  arXiv:2110.06863 (replaced) [pdf, other]
Title: Improving Users' Mental Model with Attention-directed Counterfactual Edits
Comments: Accepted for publication in Applied AI Letters
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
[218]  arXiv:2110.07511 (replaced) [pdf, other]
Title: Contrastive Proposal Extension with LSTM Network for Weakly Supervised Object Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[219]  arXiv:2110.07604 (replaced) [pdf, other]
Title: NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild
Comments: In NeurIPS 2021. v2-3: Fixed minor typos
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[220]  arXiv:2110.07723 (replaced) [pdf, other]
Title: EMDS-7: Environmental Microorganism Image Dataset Seventh Version for Multiple Object Detection Evaluation
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[221]  arXiv:2110.07809 (replaced) [pdf, other]
Title: PTQ-SL: Exploring the Sub-layerwise Post-training Quantization
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[222]  arXiv:2110.08059 (replaced) [pdf, other]
Title: FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes
Comments: First two authors contributed equally to this work
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[223]  arXiv:2001.04286 (replaced) [pdf, other]
Title: Nonparametric Continuous Sensor Registration
Comments: 50 pages. Accepted for Journal of Machine Learning Research. arXiv admin note: text overlap with arXiv:1904.02266
Subjects: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
[224]  arXiv:2003.06566 (replaced) [pdf, other]
Title: On the benefits of defining vicinal distributions in latent space
Comments: Accepted at Elsevier Pattern Recognition Letters (2021), Best Paper Award at CVPR 2021 Workshop on Adversarial Machine Learning in Real-World Computer Vision (AML-CV), Also accepted at ICLR 2021 Workshops on Robust-Reliable Machine Learning (Oral) and Generalization beyond the training distribution (Abstract)
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[225]  arXiv:2005.03566 (replaced) [pdf, other]
Title: Noisy Differentiable Architecture Search
Comments: BMVC 2021
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[226]  arXiv:2011.07010 (replaced) [pdf, other]
Title: Monitoring and Diagnosability of Perception Systems
Comments: Updated version of arXiv:2005.11816
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
[227]  arXiv:2012.01788 (replaced) [pdf, other]
Title: Object SLAM-Based Active Mapping and Robotic Grasping
Comments: Accepted for IEEE International Conference on 3D Vision (3DV), 2021. Project page: this https URL
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
[228]  arXiv:2012.03581 (replaced) [pdf, other]
Title: DIPPAS: A Deep Image Prior PRNU Anonymization Scheme
Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
[229]  arXiv:2103.14616 (replaced) [pdf, other]
Title: Training a Task-Specific Image Reconstruction Loss
Comments: Accepted at WACV 2022
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[230]  arXiv:2104.08773 (replaced) [pdf, other]
Title: Cross-Task Generalization via Natural Language Crowdsourcing Instructions
Comments: 20 pages
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[231]  arXiv:2106.10404 (replaced) [pdf, other]
Title: Sparse Training via Boosting Pruning Plasticity with Neuroregeneration
Comments: Published on the thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021). Code can be found this https URL
Journal-ref: Conference on Neural Information Processing Systems (NeurIPS 2021)
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[232]  arXiv:2108.03873 (replaced) [src]
Title: Rain Removal and Illumination Enhancement Done in One Go
Comments: In section 5.2 of the paper, the comparison results are unfair due to different calculation methods of model speed. Please allow us to correct the unfair result
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[233]  arXiv:2109.02934 (replaced) [pdf, other]
Title: Fishr: Invariant Gradient Variances for Out-of-distribution Generalization
Comments: 32 pages, 13 tables, 6 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
[234]  arXiv:2109.10000 (replaced) [pdf, other]
Title: Automated segmentation and extraction of posterior eye segment using OCT scans
Comments: Accepted in 2021 IEEE International Conference on Robotics and Automation in Industry (ICRAI)
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[235]  arXiv:2110.00804 (replaced) [pdf, other]
Title: ProTo: Program-Guided Transformer for Program-Guided Tasks
Comments: Accepted in NeurIPS 2021
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[236]  arXiv:2110.01303 (replaced) [pdf, other]
Title: Incremental Class Learning using Variational Autoencoders with Similarity Learning
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[237]  arXiv:2110.03267 (replaced) [pdf, other]
Title: Propagating State Uncertainty Through Trajectory Forecasting
Comments: 8 pages, 6 figures, 4 tables. Fixed section name typo
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[238]  arXiv:2110.03909 (replaced) [pdf, other]
Title: Meta-Learning with Task-Adaptive Loss Function for Few-Shot Learning
Comments: ICCV 2021 (Oral). Code at this https URL
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[239]  arXiv:2110.04203 (replaced) [pdf, other]
Title: Toward a Human-Level Video Understanding Intelligence
Comments: Presented at AI-HRI symposium as part of AAAI-FSS 2021 (arXiv:2109.10836). The first two authors have equal contribution
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
[240]  arXiv:2110.06976 (replaced) [pdf, other]
Title: Rethinking the Representational Continuity: Towards Unsupervised Continual Learning
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[241]  arXiv:2110.08009 (replaced) [pdf, other]
Title: MaGNET: Uniform Sampling from Deep Generative Network Manifolds Without Retraining
Comments: 13 pages, 14 pages Appendix, 23 figures
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[ total of 241 entries: 1-241 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2110, contact, help  (Access key information)