We gratefully acknowledge support from
the Simons Foundation and member institutions.

Computer Vision and Pattern Recognition

New submissions

[ total of 175 entries: 1-175 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Wed, 1 Dec 21

[1]  arXiv:2111.14831 [pdf]
Title: MIST-net: Multi-domain Integrative Swin Transformer network for Sparse-View CT Reconstruction
Comments: 21 pages, 8 figures, 57 references
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

The deep learning-based tomographic image reconstruction have been attracting much attention among these years. The sparse-view data reconstruction is one of typical underdetermined inverse problems, how to reconstruct high-quality CT images from dozens of projections is still a challenge in practice. To address this challenge, in this article we proposed a Multi-domain Integrative Swin Transformer network (MIST-net). First, the proposed MIST-net incorporated lavish domain features from data, residual-data, image, and residual-image using flexible network architectures. Here, the residual-data and residual-image domains network components can be considered as the data consistency module to eliminate interpolation errors in both residual data and image domains, and then further retain image details. Second, to detect the image features and further protect image edge, the trainable Sobel Filter was incorporated into the network to improve the encode-decode ability. Third, with the classical Swin transformer, we further designed the high-quality reconstruction transformer (i.e., Recformer) to improve the reconstruction performance. The Recformer inherited the power of Swin transformer to capture the global and local features of the reconstructed image. The experiments on the numerical datasets with 48 views demonstrated our proposed MIST-net provided higher reconstructed image quality with small feature recovery and edge protection than other competitors including the advanced unrolled networks. The quantitative results show that our MIST-net also obtained the best performance. The trained network was transferred to the real cardiac CT dataset with 48 views, the reconstruction results further validated the advantages of our MIST-net and further demonstrated the good robustness of our MIST in clinical applications.

[2]  arXiv:2111.14887 [pdf, other]
Title: DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

As acquiring pixel-wise annotations of real-world images for semantic segmentation is a costly process, a model can instead be trained with more accessible synthetic data and adapted to real images without requiring their annotations. This process is studied in unsupervised domain adaptation (UDA). Even though a large number of methods propose new adaptation strategies, they are mostly based on outdated network architectures. As the influence of recent network architectures has not been systematically studied, we first benchmark different network architectures for UDA and then propose a novel UDA method, DAFormer, based on the benchmark results. The DAFormer network consists of a Transformer encoder and a multi-level context-aware feature fusion decoder. It is enabled by three simple but crucial training strategies to stabilize the training and to avoid overfitting DAFormer to the source domain: While the Rare Class Sampling on the source domain improves the quality of pseudo-labels by mitigating the confirmation bias of self-training towards common classes, the Thing-Class ImageNet Feature Distance and a learning rate warmup promote feature transfer from ImageNet pretraining. DAFormer significantly improves the state-of-the-art performance by 10.8 mIoU for GTA->Cityscapes and 5.4 mIoU for Synthia->Cityscapes and enables learning even difficult classes such as train, bus, and truck well. The implementation is available at https://github.com/lhoyer/DAFormer.

[3]  arXiv:2111.14893 [pdf, other]
Title: Learning Multiple Dense Prediction Tasks from Partially Annotated Data
Comments: Multi-task Partially-supervised Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Despite the recent advances in multi-task learning of dense prediction problems, most methods rely on expensive labelled datasets. In this paper, we present a label efficient approach and look at jointly learning of multiple dense prediction tasks on partially annotated data, which we call multi-task partially-supervised learning. We propose a multi-task training procedure that successfully leverages task relations to supervise its multi-task learning when data is partially annotated. In particular, we learn to map each task pair to a joint pairwise task-space which enables sharing information between them in a computationally efficient way through another network conditioned on task pairs, and avoids learning trivial cross-task relations by retaining high-level information about the input image. We rigorously demonstrate that our proposed method effectively exploits the images with unlabelled tasks and outperforms existing semi-supervised learning approaches and related methods on three standard benchmarks.

[4]  arXiv:2111.14923 [pdf, other]
Title: Equitable modelling of brain imaging by counterfactual augmentation with morphologically constrained 3D deep generative models
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

We describe Countersynth, a conditional generative model of diffeomorphic deformations that induce label-driven, biologically plausible changes in volumetric brain images. The model is intended to synthesise counterfactual training data augmentations for downstream discriminative modelling tasks where fidelity is limited by data imbalance, distributional instability, confounding, or underspecification, and exhibits inequitable performance across distinct subpopulations. Focusing on demographic attributes, we evaluate the quality of synthesized counterfactuals with voxel-based morphometry, classification and regression of the conditioning attributes, and the Fr\'{e}chet inception distance. Examining downstream discriminative performance in the context of engineered demographic imbalance and confounding, we use UK Biobank magnetic resonance imaging data to benchmark CounterSynth augmentation against current solutions to these problems. We achieve state-of-the-art improvements, both in overall fidelity and equity. The source code for CounterSynth is available online.

[5]  arXiv:2111.14931 [pdf, other]
Title: How Facial Features Convey Attention in Stationary Environments
Authors: Janelle Domantay
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Awareness detection technologies have been gaining traction in a variety of enterprises; most often used for driver fatigue detection, recent research has shifted towards using computer vision technologies to analyze user attention in environments such as online classrooms. This paper aims to extend previous research on distraction detection by analyzing which visual features contribute most to predicting awareness and fatigue. We utilized the open source facial analysis toolkit OpenFace in order to analyze visual data of subjects at varying levels of attentiveness. Then, using a Support-Vector Machine (SVM) we created several prediction models for user attention and identified Histogram of Oriented Gradients (HOG) and Action Units to be the greatest predictors of the features we tested. We also compared the performance of this SVM to deep learning approaches that utilize Convolutional and/or Recurrent neural networks (CNN's and CRNN's). Interestingly, CRNN's did not appear to perform significantly better than their CNN counterparts. While deep learning methods achieved greater prediction accuracy, SVMs utilized less resources and, using certain parameters, were able to approach the performance of deep learning methods.

[6]  arXiv:2111.14943 [pdf, other]
Title: Morph Detection Enhanced by Structured Group Sparsity
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we consider the challenge of face morphing attacks, which substantially undermine the integrity of face recognition systems such as those adopted for use in border protection agencies. Morph detection can be formulated as extracting fine-grained representations, where local discriminative features are harnessed for learning a hypothesis. To acquire discriminative features at different granularity as well as a decoupled spectral information, we leverage wavelet domain analysis to gain insight into the spatial-frequency content of a morphed face. As such, instead of using images in the RGB domain, we decompose every image into its wavelet sub-bands using 2D wavelet decomposition and a deep supervised feature selection scheme is employed to find the most discriminative wavelet sub-bands of input images. To this end, we train a Deep Neural Network (DNN) morph detector using the decomposed wavelet sub-bands of the morphed and bona fide images. In the training phase, our structured group sparsity-constrained DNN picks the most discriminative wavelet sub-bands out of all the sub-bands, with which we retrain our DNN, resulting in a precise detection of morphed images when inference is achieved on a probe image. The efficacy of our deep morph detector which is enhanced by structured group lasso is validated through experiments on three facial morph image databases, i.e., VISAPP17, LMA, and MorGAN.

[7]  arXiv:2111.14948 [pdf]
Title: Image denoising by Super Neurons: Why go deep?
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Classical image denoising methods utilize the non-local self-similarity principle to effectively recover image content from noisy images. Current state-of-the-art methods use deep convolutional neural networks (CNNs) to effectively learn the mapping from noisy to clean images. Deep denoising CNNs manifest a high learning capacity and integrate non-local information owing to the large receptive field yielded by numerous cascade of hidden layers. However, deep networks are also computationally complex and require large data for training. To address these issues, this study draws the focus on the Self-organized Operational Neural Networks (Self-ONNs) empowered by a novel neuron model that can achieve a similar or better denoising performance with a compact and shallow model. Recently, the concept of super-neurons has been introduced which augment the non-linear transformations of generative neurons by utilizing non-localized kernel locations for an enhanced receptive field size. This is the key accomplishment which renders the need for a deep network configuration. As the integration of non-local information is known to benefit denoising, in this work we investigate the use of super neurons for both synthetic and real-world image denoising. We also discuss the practical issues in implementing the super neuron model on GPUs and propose a trade-off between the heterogeneity of non-localized operations and computational complexity. Our results demonstrate that with the same width and depth, Self-ONNs with super neurons provide a significant boost of denoising performance over the networks with generative and convolutional neurons for both denoising tasks. Moreover, results demonstrate that Self-ONNs with super neurons can achieve a competitive and superior synthetic denoising performances than well-known deep CNN denoisers for synthetic and real-world denoising, respectively.

[8]  arXiv:2111.14973 [pdf, other]
Title: MultiPath++: Efficient Information Fusion and Trajectory Aggregation for Behavior Prediction
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

Predicting the future behavior of road users is one of the most challenging and important problems in autonomous driving. Applying deep learning to this problem requires fusing heterogeneous world state in the form of rich perception signals and map information, and inferring highly multi-modal distributions over possible futures. In this paper, we present MultiPath++, a future prediction model that achieves state-of-the-art performance on popular benchmarks. MultiPath++ improves the MultiPath architecture by revisiting many design choices. The first key design difference is a departure from dense image-based encoding of the input world state in favor of a sparse encoding of heterogeneous scene elements: MultiPath++ consumes compact and efficient polylines to describe road features, and raw agent state information directly (e.g., position, velocity, acceleration). We propose a context-aware fusion of these elements and develop a reusable multi-context gating fusion component. Second, we reconsider the choice of pre-defined, static anchors, and develop a way to learn latent anchor embeddings end-to-end in the model. Lastly, we explore ensembling and output aggregation techniques -- common in other ML domains -- and find effective variants for our probabilistic multimodal output representation. We perform an extensive ablation on these design choices, and show that our proposed model achieves state-of-the-art performance on the Argoverse Motion Forecasting Competition and the Waymo Open Dataset Motion Prediction Challenge.

[9]  arXiv:2111.15000 [pdf, other]
Title: Deformable ProtoPNet: An Interpretable Image Classifier Using Deformable Prototypes
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Machine learning has been widely adopted in many domains, including high-stakes applications such as healthcare, finance, and criminal justice. To address concerns of fairness, accountability and transparency, predictions made by machine learning models in these critical domains must be interpretable. One line of work approaches this challenge by integrating the power of deep neural networks and the interpretability of case-based reasoning to produce accurate yet interpretable image classification models. These models generally classify input images by comparing them with prototypes learned during training, yielding explanations in the form of "this looks like that." However, methods from this line of work use spatially rigid prototypes, which cannot explicitly account for pose variations. In this paper, we address this shortcoming by proposing a case-based interpretable neural network that provides spatially flexible prototypes, called a deformable prototypical part network (Deformable ProtoPNet). In a Deformable ProtoPNet, each prototype is made up of several prototypical parts that adaptively change their relative spatial positions depending on the input image. This enables each prototype to detect object features with a higher tolerance to spatial transformations, as the parts within a prototype are allowed to move. Consequently, a Deformable ProtoPNet can explicitly capture pose variations, improving both model accuracy and the richness of explanations provided. Compared to other case-based interpretable models using prototypes, our approach achieves competitive accuracy, gives an explanation with greater context, and is easier to train, thus enabling wider use of interpretable models for computer vision.

[10]  arXiv:2111.15015 [pdf, other]
Title: Neural Attention for Image Captioning: Review of Outstanding Methods
Comments: This is the accepted version, which we are allowed to publish on arxiv based on Springer Nature policies. For the published version please refer to Springer Nature Artificial Intelligence Review Journal. DOI number is attached. For Citation refer to AIRE journal using DOI link
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Image captioning is the task of automatically generating sentences that describe an input image in the best way possible. The most successful techniques for automatically generating image captions have recently used attentive deep learning models. There are variations in the way deep learning models with attention are designed. In this survey, we provide a review of literature related to attentive deep learning models for image captioning. Instead of offering a comprehensive review of all prior work on deep image captioning models, we explain various types of attention mechanisms used for the task of image captioning in deep learning models. The most successful deep learning models used for image captioning follow the encoder-decoder architecture, although there are differences in the way these models employ attention mechanisms. Via analysis on performance results from different attentive deep models for image captioning, we aim at finding the most successful types of attention mechanisms in deep models for image captioning. Soft attention, bottom-up attention, and multi-head attention are the types of attention mechanism widely used in state-of-the-art attentive deep learning models for image captioning. At the current time, the best results are achieved from variants of multi-head attention with bottom-up attention.

[11]  arXiv:2111.15018 [pdf, other]
Title: Hyperspectral Image Segmentation based on Graph Processing over Multilayer Networks
Subjects: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

Hyperspectral imaging is an important sensing technology with broad applications and impact in areas including environmental science, weather, and geo/space exploration. One important task of hyperspectral image (HSI) processing is the extraction of spectral-spatial features. Leveraging on the recent-developed graph signal processing over multilayer networks (M-GSP), this work proposes several approaches to HSI segmentation based on M-GSP feature extraction. To capture joint spectral-spatial information, we first customize a tensor-based multilayer network (MLN) model for HSI, and define a MLN singular space for feature extraction. We then develop an unsupervised HSI segmentation method by utilizing MLN spectral clustering. Regrouping HSI pixels via MLN-based clustering, we further propose a semi-supervised HSI classification based on multi-resolution fusions of superpixels. Our experimental results demonstrate the strength of M-GSP in HSI processing and spectral-spatial information extraction.

[12]  arXiv:2111.15047 [pdf, other]
Title: Adaptive Gating for Single-Photon 3D Imaging
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Single-photon avalanche diodes (SPADs) are growing in popularity for depth sensing tasks. However, SPADs still struggle in the presence of high ambient light due to the effects of pile-up. Conventional techniques leverage fixed or asynchronous gating to minimize pile-up effects, but these gating schemes are all non-adaptive, as they are unable to incorporate factors such as scene priors and previous photon detections into their gating strategy. We propose an adaptive gating scheme built upon Thompson sampling. Adaptive gating periodically updates the gate position based on prior photon observations in order to minimize depth errors. Our experiments show that our gating strategy results in significantly reduced depth reconstruction error and acquisition time, even when operating outdoors under strong sunlight conditions.

[13]  arXiv:2111.15050 [pdf, other]
Title: AssistSR: Affordance-centric Question-driven Video Segment Retrieval
Comments: 15 pages, 11 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

It is still a pipe dream that AI assistants on phone and AR glasses can assist our daily life in addressing our questions like "how to adjust the date for this watch?" and "how to set its heating duration? (while pointing at an oven)". The queries used in conventional tasks (i.e. Video Question Answering, Video Retrieval, Moment Localization) are often factoid and based on pure text. In contrast, we present a new task called Affordance-centric Question-driven Video Segment Retrieval (AQVSR). Each of our questions is an image-box-text query that focuses on affordance of items in our daily life and expects relevant answer segments to be retrieved from a corpus of instructional video-transcript segments. To support the study of this AQVSR task, we construct a new dataset called AssistSR. We design novel guidelines to create high-quality samples. This dataset contains 1.4k multimodal questions on 1k video segments from instructional videos on diverse daily-used items. To address AQVSR, we develop a straightforward yet effective model called Dual Multimodal Encoders (DME) that significantly outperforms several baseline methods while still having large room for improvement in the future. Moreover, we present detailed ablation analyses. Our codes and data are available at https://github.com/StanLei52/AQVSR.

[14]  arXiv:2111.15056 [pdf, other]
Title: Camera Distortion-aware 3D Human Pose Estimation in Video with Optimization-based Meta-Learning
Comments: Accepted to ICCV 2021 (poster)
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Existing 3D human pose estimation algorithms trained on distortion-free datasets suffer performance drop when applied to new scenarios with a specific camera distortion. In this paper, we propose a simple yet effective model for 3D human pose estimation in video that can quickly adapt to any distortion environment by utilizing MAML, a representative optimization-based meta-learning algorithm. We consider a sequence of 2D keypoints in a particular distortion as a single task of MAML. However, due to the absence of a large-scale dataset in a distorted environment, we propose an efficient method to generate synthetic distorted data from undistorted 2D keypoints. For the evaluation, we assume two practical testing situations depending on whether a motion capture sensor is available or not. In particular, we propose Inference Stage Optimization using bone-length symmetry and consistency. Extensive evaluation shows that our proposed method successfully adapts to various degrees of distortion in the testing phase and outperforms the existing state-of-the-art approaches. The proposed method is useful in practice because it does not require camera calibration and additional computations in a testing set-up.

[15]  arXiv:2111.15064 [pdf]
Title: Hole-robust Wireframe Detection
Comments: To appear in Proceedings of the 2022 IEEE Winter Conference on Applications of Computer Vision (WACV 2022)
Subjects: Computer Vision and Pattern Recognition (cs.CV)

"Wireframe" is a line segment based representation designed to well capture large-scale visual properties of regular, structural shaped man-made scenes surrounding us. Unlike the wireframes, conventional edges or line segments focus on all visible edges and lines without particularly distinguishing which of them are more salient to man-made structural information. Existing wireframe detection models rely on supervising the annotated data but do not explicitly pay attention to understand how to compose the structural shapes of the scene. In addition, we often face that many foreground objects occluding the background scene interfere with proper inference of the full scene structure behind them. To resolve these problems, we first time in the field, propose new conditional data generation and training that help the model understand how to ignore occlusion indicated by holes, such as foreground object regions masked out on the image. In addition, we first time combine GAN in the model to let the model better predict underlying scene structure even beyond large holes. We also introduce pseudo labeling to further enlarge the model capacity to overcome small-scale labeled data. We show qualitatively and quantitatively that our approach significantly outperforms previous works unable to handle holes, as well as improves ordinary detection without holes given.

[16]  arXiv:2111.15077 [pdf, other]
Title: Unsupervised Domain Generalization for Person Re-identification: A Domain-specific Adaptive Framework
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Domain generalization (DG) has attracted much attention in person re-identification (ReID) recently. It aims to make a model trained on multiple source domains generalize to an unseen target domain. Although achieving promising progress, existing methods usually need the source domains to be labeled, which could be a significant burden for practical ReID tasks. In this paper, we turn to investigate unsupervised domain generalization for ReID, by assuming that no label is available for any source domains.
To address this challenging setting, we propose a simple and efficient domain-specific adaptive framework, and realize it with an adaptive normalization module designed upon the batch and instance normalization techniques. In doing so, we successfully yield reliable pseudo-labels to implement training and also enhance the domain generalization capability of the model as required. In addition, we show that our framework can even be applied to improve person ReID under the settings of supervised domain generalization and unsupervised domain adaptation, demonstrating competitive performance with respect to relevant methods. Extensive experimental study on benchmark datasets is conducted to validate the proposed framework. A significance of our work lies in that it shows the potential of unsupervised domain generalization for person ReID and sets a strong baseline for the further research on this topic.

[17]  arXiv:2111.15078 [pdf, other]
Title: SketchEdit: Mask-Free Local Image Manipulation with Partial Sketches
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

Sketch-based image manipulation is an interactive image editing task to modify an image based on input sketches from users. Existing methods typically formulate this task as a conditional inpainting problem, which requires users to draw an extra mask indicating the region to modify in addition to sketches. The masked regions are regarded as holes and filled by an inpainting model conditioned on the sketch. With this formulation, paired training data can be easily obtained by randomly creating masks and extracting edges or contours. Although this setup simplifies data preparation and model design, it complicates user interaction and discards useful information in masked regions. To this end, we investigate a new paradigm of sketch-based image manipulation: mask-free local image manipulation, which only requires sketch inputs from users and utilizes the entire original image. Given an image and sketch, our model automatically predicts the target modification region and encodes it into a structure agnostic style vector. A generator then synthesizes the new image content based on the style vector and sketch. The manipulated image is finally produced by blending the generator output into the modification region of the original image. Our model can be trained in a self-supervised fashion by learning the reconstruction of an image region from the style vector and sketch. The proposed method offers simpler and more intuitive user workflows for sketch-based image manipulation and provides better results than previous approaches. More results, code and interactive demo will be available at \url{https://zengxianyu.github.io/sketchedit}.

[18]  arXiv:2111.15097 [pdf, other]
Title: EAGAN: Efficient Two-stage Evolutionary Architecture Search for GANs
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

Generative Adversarial Networks (GANs) have been proven hugely successful in image generation tasks, but GAN training has the problem of instability. Many works have improved the stability of GAN training by manually modifying the GAN architecture, which requires human expertise and extensive trial-and-error. Thus, neural architecture search (NAS), which aims to automate the model design, has been applied to search GANs on the task of unconditional image generation. The early NAS-GAN works only search generators for reducing the difficulty. Some recent works have attempted to search both generator (G) and discriminator (D) to improve GAN performance, but they still suffer from the instability of GAN training during the search. To alleviate the instability issue, we propose an efficient two-stage evolutionary algorithm (EA) based NAS framework to discover GANs, dubbed \textbf{EAGAN}. Specifically, we decouple the search of G and D into two stages and propose the weight-resetting strategy to improve the stability of GAN training. Besides, we perform evolution operations to produce the Pareto-front architectures based on multiple objectives, resulting in a superior combination of G and D. By leveraging the weight-sharing strategy and low-fidelity evaluation, EAGAN can significantly shorten the search time. EAGAN achieves highly competitive results on the CIFAR-10 (IS=8.81$\pm$0.10, FID=9.91) and surpasses previous NAS-searched GANs on the STL-10 dataset (IS=10.44$\pm$0.087, FID=22.18).

[19]  arXiv:2111.15111 [pdf]
Title: Automatic tracing of mandibular canal pathways using deep learning
Subjects: Computer Vision and Pattern Recognition (cs.CV)

There is an increasing demand in medical industries to have automated systems for detection and localization which are manually inefficient otherwise. In dentistry, it bears great interest to trace the pathway of mandibular canals accurately. Proper localization of the position of the mandibular canals, which surrounds the inferior alveolar nerve (IAN), reduces the risk of damaging it during dental implantology. Manual detection of canal paths is not an efficient way in terms of time and labor. Here, we propose a deep learning-based framework to detect mandibular canals from CBCT data. It is a 3-stage process fully automatic end-to-end. Ground truths are generated in the preprocessing stage. Instead of using commonly used fixed diameter tubular-shaped ground truth, we generate centerlines of the mandibular canals and used them as ground truths in the training process. A 3D U-Net architecture is used for model training. An efficient post-processing stage is developed to rectify the initial prediction. The precision, recall, F1-score, and IoU are measured to analyze the voxel-level segmentation performance. However, to analyze the distance-based measurements, mean curve distance (MCD) both from ground truth to prediction and prediction to ground truth is calculated. Extensive experiments are conducted to demonstrate the effectiveness of the model.

[20]  arXiv:2111.15113 [pdf, other]
Title: LatentHuman: Shape-and-Pose Disentangled Latent Representation for Human Bodies
Comments: Accepted to 3DV 2021. Project Page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

3D representation and reconstruction of human bodies have been studied for a long time in computer vision. Traditional methods rely mostly on parametric statistical linear models, limiting the space of possible bodies to linear combinations. It is only recently that some approaches try to leverage neural implicit representations for human body modeling, and while demonstrating impressive results, they are either limited by representation capability or not physically meaningful and controllable. In this work, we propose a novel neural implicit representation for the human body, which is fully differentiable and optimizable with disentangled shape and pose latent spaces. Contrary to prior work, our representation is designed based on the kinematic model, which makes the representation controllable for tasks like pose animation, while simultaneously allowing the optimization of shape and pose for tasks like 3D fitting and pose tracking. Our model can be trained and fine-tuned directly on non-watertight raw data with well-designed losses. Experiments demonstrate the improved 3D reconstruction performance over SoTA approaches and show the applicability of our method to shape interpolation, model fitting, pose tracking, and motion retargeting.

[21]  arXiv:2111.15114 [pdf, other]
Title: ePose: Let's Make EfficientPose More Generally Applicable
Comments: 7 pages, 8 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

EfficientPose is an impressive 3D object detection model. It has been demonstrated to be quick, scalable, and accurate, especially when considering that it uses only RGB inputs. In this paper we try to improve on EfficientPose by giving it the ability to infer an object's size, and by simplifying both the data collection and loss calculations. We evaluated ePose using the Linemod dataset and a new subset of it called "Occlusion 1-class". We also outline our current progress and thoughts about using ePose with the NuScenes and the 2017 KITTI 3D Object Detection datasets. The source code is available at https://github.com/tbd-clip/EfficientPose.

[22]  arXiv:2111.15119 [pdf, other]
Title: Aerial Images Meet Crowdsourced Trajectories: A New Approach to Robust Road Extraction
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Land remote sensing analysis is a crucial research in earth science. In this work, we focus on a challenging task of land analysis, i.e., automatic extraction of traffic roads from remote sensing data, which has widespread applications in urban development and expansion estimation. Nevertheless, conventional methods either only utilized the limited information of aerial images, or simply fused multimodal information (e.g., vehicle trajectories), thus cannot well recognize unconstrained roads. To facilitate this problem, we introduce a novel neural network framework termed Cross-Modal Message Propagation Network (CMMPNet), which fully benefits the complementary different modal data (i.e., aerial images and crowdsourced trajectories). Specifically, CMMPNet is composed of two deep Auto-Encoders for modality-specific representation learning and a tailor-designed Dual Enhancement Module for cross-modal representation refinement. In particular, the complementary information of each modality is comprehensively extracted and dynamically propagated to enhance the representation of another modality. Extensive experiments on three real-world benchmarks demonstrate the effectiveness of our CMMPNet for robust road extraction benefiting from blending different modal data, either using image and trajectory data or image and Lidar data. From the experimental results, we observe that the proposed approach outperforms current state-of-the-art methods by large margins.

[23]  arXiv:2111.15121 [pdf, other]
Title: Pyramid Adversarial Training Improves ViT Performance
Comments: 32 pages, including references & supplementary material
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Aggressive data augmentation is a key component of the strong generalization capabilities of Vision Transformer (ViT). One such data augmentation technique is adversarial training; however, many prior works have shown that this often results in poor clean accuracy. In this work, we present Pyramid Adversarial Training, a simple and effective technique to improve ViT's overall performance. We pair it with a "matched" Dropout and stochastic depth regularization, which adopts the same Dropout and stochastic depth configuration for the clean and adversarial samples. Similar to the improvements on CNNs by AdvProp (not directly applicable to ViT), our Pyramid Adversarial Training breaks the trade-off between in-distribution accuracy and out-of-distribution robustness for ViT and related architectures. It leads to $1.82\%$ absolute improvement on ImageNet clean accuracy for the ViT-B model when trained only on ImageNet-1K data, while simultaneously boosting performance on $7$ ImageNet robustness metrics, by absolute numbers ranging from $1.76\%$ to $11.45\%$. We set a new state-of-the-art for ImageNet-C (41.4 mCE), ImageNet-R ($53.92\%$), and ImageNet-Sketch ($41.04\%$) without extra data, using only the ViT-B/16 backbone and our Pyramid Adversarial Training. Our code will be publicly available upon acceptance.

[24]  arXiv:2111.15124 [pdf, other]
Title: In-Bed Human Pose Estimation from Unseen and Privacy-Preserving Image Domains
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Medical applications have benefited from the rapid advancement in computer vision. For patient monitoring in particular, in-bed human posture estimation provides important health-related metrics with potential value in medical condition assessments. Despite great progress in this domain, it remains a challenging task due to substantial ambiguity during occlusions, and the lack of large corpora of manually labeled data for model training, particularly with domains such as thermal infrared imaging which are privacy-preserving, and thus of great interest. Motivated by the effectiveness of self-supervised methods in learning features directly from data, we propose a multi-modal conditional variational autoencoder (MC-VAE) capable of reconstructing features from missing modalities seen during training. This approach is used with HRNet to enable single modality inference for in-bed pose estimation. Through extensive evaluations, we demonstrate that body positions can be effectively recognized from the available modality, achieving on par results with baseline models that are highly dependent on having access to multiple modes at inference time. The proposed framework supports future research towards self-supervised learning that generates a robust model from a single source, and expects it to generalize over many unknown distributions in clinical environments.

[25]  arXiv:2111.15127 [pdf, other]
Title: A Unified Pruning Framework for Vision Transformers
Authors: Hao Yu, Jianxin Wu
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recently, vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks. Yet the high computational costs and training data requirements of ViTs limit their application in resource-constrained settings. Model compression is an effective method to speed up deep learning models, but the research of compressing ViTs has been less explored. Many previous works concentrate on reducing the number of tokens. However, this line of attack breaks down the spatial structure of ViTs and is hard to be generalized into downstream tasks. In this paper, we design a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs. Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure. Abundant experimental results show that our method can achieve high accuracy on compressed ViTs and variants, e.g., UP-DeiT-T achieves 75.79% accuracy on ImageNet, which outperforms the vanilla DeiT-T by 3.59% with the same computational cost. UP-PVTv2-B0 improves the accuracy of PVTv2-B0 by 4.83% for ImageNet classification. Meanwhile, UP-ViTs maintains the consistency of the token representation and gains consistent improvements on object detection tasks.

[26]  arXiv:2111.15129 [pdf, other]
Title: Anonymization for Skeleton Action Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

The skeleton-based action recognition attracts practitioners and researchers due to the lightweight, compact nature of datasets. Compared with RGB-video-based action recognition, skeleton-based action recognition is a safer way to protect the privacy of subjects while having competitive recognition performance. However, due to the improvements of skeleton estimation algorithms as well as motion- and depth-sensors, more details of motion characteristics can be preserved in the skeleton dataset, leading to a potential privacy leakage from the dataset. To investigate the potential privacy leakage from the skeleton datasets, we first train a classifier to categorize sensitive private information from a trajectory of joints. Experiments show the model trained to classify gender can predict with 88% accuracy and re-identify a person with 82% accuracy. We propose two variants of anonymization algorithms to protect the potential privacy leakage from the skeleton dataset. Experimental results show that the anonymized dataset can reduce the risk of privacy leakage while having marginal effects on the action recognition performance.

[27]  arXiv:2111.15140 [pdf, other]
Title: Robust 3D Garment Digitization from Monocular 2D Images for 3D Virtual Try-On Systems
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we develop a robust 3D garment digitization solution that can generalize well on real-world fashion catalog images with cloth texture occlusions and large body pose variations. We assumed fixed topology parametric template mesh models for known types of garments (e.g., T-shirts, Trousers) and perform mapping of high-quality texture from an input catalog image to UV map panels corresponding to the parametric mesh model of the garment. We achieve this by first predicting a sparse set of 2D landmarks on the boundary of the garments. Subsequently, we use these landmarks to perform Thin-Plate-Spline-based texture transfer on UV map panels. Subsequently, we employ a deep texture inpainting network to fill the large holes (due to view variations & self-occlusions) in TPS output to generate consistent UV maps. Furthermore, to train the supervised deep networks for landmark prediction & texture inpainting tasks, we generated a large set of synthetic data with varying texture and lighting imaged from various views with the human present in a wide variety of poses. Additionally, we manually annotated a small set of fashion catalog images crawled from online fashion e-commerce platforms to finetune. We conduct thorough empirical evaluations and show impressive qualitative results of our proposed 3D garment texture solution on fashion catalog images. Such 3D garment digitization helps us solve the challenging task of enabling 3D Virtual Try-on.

[28]  arXiv:2111.15143 [pdf, other]
Title: HEAT: Holistic Edge Attention Transformer for Structured Reconstruction
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper presents a novel attention-based neural network for structured reconstruction, which takes a 2D raster image as an input and reconstructs a planar graph depicting an underlying geometric structure. The approach detects corners and classifies edge candidates between corners in an end-to-end manner. Our contribution is a holistic edge classification architecture, which 1) initializes the feature of an edge candidate by a trigonometric positional encoding of its end-points; 2) fuses image feature to each edge candidate by deformable attention; 3) employs two weight-sharing Transformer decoders to learn holistic structural patterns over the graph edge candidates; and 4) is trained with a masked learning strategy. The corner detector is a variant of the edge classification architecture, adapted to operate on pixels as corner candidates. We conduct experiments on two structured reconstruction tasks: outdoor building architecture and indoor floorplan planar graph reconstruction. Extensive qualitative and quantitative evaluations demonstrate the superiority of our approach over the state of the art. We will share code and models.

[29]  arXiv:2111.15150 [pdf, other]
Title: AirObject: A Temporally Evolving Graph Embedding for Object Identification
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Object encoding and identification are vital for robotic tasks such as autonomous exploration, semantic scene understanding, and re-localization. Previous approaches have attempted to either track objects or generate descriptors for object identification. However, such systems are limited to a "fixed" partial object representation from a single viewpoint. In a robot exploration setup, there is a requirement for a temporally "evolving" global object representation built as the robot observes the object from multiple viewpoints. Furthermore, given the vast distribution of unknown novel objects in the real world, the object identification process must be class-agnostic. In this context, we propose a novel temporal 3D object encoding approach, dubbed AirObject, to obtain global keypoint graph-based embeddings of objects. Specifically, the global 3D object embeddings are generated using a temporal convolutional network across structural information of multiple frames obtained from a graph attention-based encoding method. We demonstrate that AirObject achieves the state-of-the-art performance for video object identification and is robust to severe occlusion, perceptual aliasing, viewpoint shift, deformation, and scale transform, outperforming the state-of-the-art single-frame and sequential descriptors. To the best of our knowledge, AirObject is one of the first temporal object encoding methods.

[30]  arXiv:2111.15157 [pdf, other]
Title: MMPTRACK: Large-scale Densely Annotated Multi-camera Multiple People Tracking Benchmark
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Multi-camera tracking systems are gaining popularity in applications that demand high-quality tracking results, such as frictionless checkout because monocular multi-object tracking (MOT) systems often fail in cluttered and crowded environments due to occlusion. Multiple highly overlapped cameras can significantly alleviate the problem by recovering partial 3D information. However, the cost of creating a high-quality multi-camera tracking dataset with diverse camera settings and backgrounds has limited the dataset scale in this domain. In this paper, we provide a large-scale densely-labeled multi-camera tracking dataset in five different environments with the help of an auto-annotation system. The system uses overlapped and calibrated depth and RGB cameras to build a high-performance 3D tracker that automatically generates the 3D tracking results. The 3D tracking results are projected to each RGB camera view using camera parameters to create 2D tracking results. Then, we manually check and correct the 3D tracking results to ensure the label quality, which is much cheaper than fully manual annotation. We have conducted extensive experiments using two real-time multi-camera trackers and a person re-identification (ReID) model with different settings. This dataset provides a more reliable benchmark of multi-camera, multi-object tracking systems in cluttered and crowded environments. Also, our results demonstrate that adapting the trackers and ReID models on this dataset significantly improves their performance. Our dataset will be publicly released upon the acceptance of this work.

[31]  arXiv:2111.15158 [pdf, other]
Title: A Dataset-Dispersion Perspective on Reconstruction Versus Recognition in Single-View 3D Reconstruction Networks
Comments: Accepted to 3DV 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Neural networks (NN) for single-view 3D reconstruction (SVR) have gained in popularity. Recent work points out that for SVR, most cutting-edge NNs have limited performance on reconstructing unseen objects because they rely primarily on recognition (i.e., classification-based methods) rather than shape reconstruction. To understand this issue in depth, we provide a systematic study on when and why NNs prefer recognition to reconstruction and vice versa. Our finding shows that a leading factor in determining recognition versus reconstruction is how dispersed the training data is. Thus, we introduce the dispersion score, a new data-driven metric, to quantify this leading factor and study its effect on NNs. We hypothesize that NNs are biased toward recognition when training images are more dispersed and training shapes are less dispersed. Our hypothesis is supported and the dispersion score is proved effective through our experiments on synthetic and benchmark datasets. We show that the proposed metric is a principal way to analyze reconstruction quality and provides novel information in addition to the conventional reconstruction score.

[32]  arXiv:2111.15162 [pdf, other]
Title: CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning
Comments: 10 pages, 9 figures, 5 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)

For video captioning, "pre-training and fine-tuning" has become a de facto paradigm, where ImageNet Pre-training (INP) is usually used to help encode the video content, and a task-oriented network is fine-tuned from scratch to cope with caption generation. Comparing INP with the recently proposed CLIP (Contrastive Language-Image Pre-training), this paper investigates the potential deficiencies of INP for video captioning and explores the key to generating accurate descriptions. Specifically, our empirical study on INP vs. CLIP shows that INP makes video caption models tricky to capture attributes' semantics and sensitive to irrelevant background information. By contrast, CLIP's significant boost in caption quality highlights the importance of attribute-aware representation learning. We are thus motivated to introduce Dual Attribute Prediction, an auxiliary task requiring a video caption model to learn the correspondence between video content and attributes and the co-occurrence relations between attributes. Extensive experiments on benchmark datasets demonstrate that our approach enables better learning of attribute-aware representations, bringing consistent improvements on models with different architectures and decoding algorithms.

[33]  arXiv:2111.15171 [pdf, other]
Title: Generative Convolution Layer for Image Generation
Comments: Submitted to Neural Networks
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper introduces a novel convolution method, called generative convolution (GConv), which is simple yet effective for improving the generative adversarial network (GAN) performance. Unlike the standard convolution, GConv first selects useful kernels compatible with the given latent vector, and then linearly combines the selected kernels to make latent-specific kernels. Using the latent-specific kernels, the proposed method produces the latent-specific features which encourage the generator to produce high-quality images. This approach is simple but surprisingly effective. First, the GAN performance is significantly improved with a little additional hardware cost. Second, GConv can be employed to the existing state-of-the-art generators without modifying the network architecture. To reveal the superiority of GConv, this paper provides extensive experiments using various standard datasets including CIFAR-10, CIFAR-100, LSUN-Church, CelebA, and tiny-ImageNet. Quantitative evaluations prove that GConv significantly boosts the performances of the unconditional and conditional GANs in terms of Inception score (IS) and Frechet inception distance (FID). For example, the proposed method improves both FID and IS scores on the tiny-ImageNet dataset from 35.13 to 29.76 and 20.23 to 22.64, respectively.

[34]  arXiv:2111.15174 [pdf, other]
Title: CRIS: CLIP-Driven Referring Image Segmentation
Comments: 15 pages, 6 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Referring image segmentation aims to segment a referent via a natural linguistic expression.Due to the distinct data properties between text and image, it is challenging for a network to well align text and pixel-level features. Existing approaches use pretrained models to facilitate learning, yet separately transfer the language/vision knowledge from pretrained models, ignoring the multi-modal corresponding information. Inspired by the recent advance in Contrastive Language-Image Pretraining (CLIP), in this paper, we propose an end-to-end CLIP-Driven Referring Image Segmentation framework (CRIS). To transfer the multi-modal knowledge effectively, CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. More specifically, we design a vision-language decoder to propagate fine-grained semantic information from textual representations to each pixel-level activation, which promotes consistency between the two modalities. In addition, we present text-to-pixel contrastive learning to explicitly enforce the text feature similar to the related pixel-level features and dissimilar to the irrelevances. The experimental results on three benchmark datasets demonstrate that our proposed framework significantly outperforms the state-of-the-art performance without any post-processing. The code will be released.

[35]  arXiv:2111.15181 [pdf, other]
Title: Zero-Shot Semantic Segmentation via Spatial and Multi-Scale Aware Visual Class Embedding
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Fully supervised semantic segmentation technologies bring a paradigm shift in scene understanding. However, the burden of expensive labeling cost remains as a challenge. To solve the cost problem, recent studies proposed language model based zero-shot semantic segmentation (L-ZSSS) approaches. In this paper, we address L-ZSSS has a limitation in generalization which is a virtue of zero-shot learning. Tackling the limitation, we propose a language-model-free zero-shot semantic segmentation framework, Spatial and Multi-scale aware Visual Class Embedding Network (SM-VCENet). Furthermore, leveraging vision-oriented class embedding SM-VCENet enriches visual information of the class embedding by multi-scale attention and spatial attention. We also propose a novel benchmark (PASCAL2COCO) for zero-shot semantic segmentation, which provides generalization evaluation by domain adaptation and contains visually challenging samples. In experiments, our SM-VCENet outperforms zero-shot semantic segmentation state-of-the-art by a relative margin in PASCAL-5i benchmark and shows generalization-robustness in PASCAL2COCO benchmark.

[36]  arXiv:2111.15185 [pdf, other]
Title: SamplingAug: On the Importance of Patch Sampling Augmentation for Single Image Super-Resolution
Comments: BMVC 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

With the development of Deep Neural Networks (DNNs), plenty of methods based on DNNs have been proposed for Single Image Super-Resolution (SISR). However, existing methods mostly train the DNNs on uniformly sampled LR-HR patch pairs, which makes them fail to fully exploit informative patches within the image. In this paper, we present a simple yet effective data augmentation method. We first devise a heuristic metric to evaluate the informative importance of each patch pair. In order to reduce the computational cost for all patch pairs, we further propose to optimize the calculation of our metric by integral image, achieving about two orders of magnitude speedup. The training patch pairs are sampled according to their informative importance with our method. Extensive experiments show our sampling augmentation can consistently improve the convergence and boost the performance of various SISR architectures, including EDSR, RCAN, RDN, SRCNN and ESPCN across different scaling factors (x2, x3, x4). Code is available at https://github.com/littlepure2333/SamplingAug

[37]  arXiv:2111.15192 [pdf, other]
Title: PlantStereo: A Stereo Matching Benchmark for Plant Surface Dense Reconstruction
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Stereo matching is an important task in computer vision which has drawn tremendous research attention for decades. While in terms of disparity accuracy, density and data size, public stereo datasets are difficult to meet the requirements of models. In this paper, we aim to address the issue between datasets and models and propose a large scale stereo dataset with high accuracy disparity ground truth named PlantStereo. We used a semi-automatic way to construct the dataset: after camera calibration and image registration, high accuracy disparity images can be obtained from the depth images. In total, PlantStereo contains 812 image pairs covering a diverse set of plants: spinach, tomato, pepper and pumpkin. We firstly evaluated our PlantStereo dataset on four different stereo matching methods. Extensive experiments on different models and plants show that compared with ground truth in integer accuracy, high accuracy disparity images provided by PlantStereo can remarkably improve the training effect of deep learning models. This paper provided a feasible and reliable method to realize plant surface dense reconstruction. The PlantStereo dataset and relative code are available at: https://www.github.com/wangqingyu985/PlantStereo

[38]  arXiv:2111.15193 [pdf, other]
Title: Shunted Self-Attention via Multi-Scale Token Aggregation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent Vision Transformer~(ViT) models have demonstrated encouraging results across various computer vision tasks, thanks to their competence in modeling long-range dependencies of image patches or tokens via self-attention. These models, however, usually designate the similar receptive fields of each token feature within each layer. Such a constraint inevitably limits the ability of each self-attention layer in capturing multi-scale features, thereby leading to performance degradation in handling images with multiple objects of different scales. To address this issue, we propose a novel and generic strategy, termed shunted self-attention~(SSA), that allows ViTs to model the attentions at hybrid scales per attention layer. The key idea of SSA is to inject heterogeneous receptive field sizes into tokens: before computing the self-attention matrix, it selectively merges tokens to represent larger object features while keeping certain tokens to preserve fine-grained features. This novel merging scheme enables the self-attention to learn relationships between objects with different sizes and simultaneously reduces the token numbers and the computational cost. Extensive experiments across various tasks demonstrate the superiority of SSA. Specifically, the SSA-based transformer achieves 84.0\% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet with only half of the model size and computation cost, and surpasses Focal Transformer by 1.3 mAP on COCO and 2.9 mIOU on ADE20K under similar parameter and computation cost. Code has been released at https://github.com/OliverRensu/Shunted-Transformer.

[39]  arXiv:2111.15199 [pdf, other]
Title: Semi-Supervised 3D Hand Shape and Pose Estimation with Label Propagation
Comments: DICTA 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

To obtain 3D annotations, we are restricted to controlled environments or synthetic datasets, leading us to 3D datasets with less generalizability to real-world scenarios. To tackle this issue in the context of semi-supervised 3D hand shape and pose estimation, we propose the Pose Alignment network to propagate 3D annotations from labelled frames to nearby unlabelled frames in sparsely annotated videos. We show that incorporating the alignment supervision on pairs of labelled-unlabelled frames allows us to improve the pose estimation accuracy. Besides, we show that the proposed Pose Alignment network can effectively propagate annotations on unseen sparsely labelled videos without fine-tuning.

[40]  arXiv:2111.15207 [pdf, other]
Title: NeeDrop: Self-supervised Shape Representation from Sparse Point Clouds using Needle Dropping
Comments: 22 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG); Machine Learning (cs.LG)

There has been recently a growing interest for implicit shape representations. Contrary to explicit representations, they have no resolution limitations and they easily deal with a wide variety of surface topologies. To learn these implicit representations, current approaches rely on a certain level of shape supervision (e.g., inside/outside information or distance-to-shape knowledge), or at least require a dense point cloud (to approximate well enough the distance-to-shape). In contrast, we introduce {\method}, an self-supervised method for learning shape representations from possibly extremely sparse point clouds. Like in Buffon's needle problem, we "drop" (sample) needles on the point cloud and consider that, statistically, close to the surface, the needle end points lie on opposite sides of the surface. No shape knowledge is required and the point cloud can be highly sparse, e.g., as lidar point clouds acquired by vehicles. Previous self-supervised shape representation approaches fail to produce good-quality results on this kind of data. We obtain quantitative results on par with existing supervised approaches on shape reconstruction datasets and show promising qualitative results on hard autonomous driving datasets such as KITTI.

[41]  arXiv:2111.15208 [pdf]
Title: HRNET: AI on Edge for mask detection and social distancing
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

The purpose of the paper is to provide innovative emerging technology framework for community to combat epidemic situations. The paper proposes a unique outbreak response system framework based on artificial intelligence and edge computing for citizen centric services to help track and trace people eluding safety policies like mask detection and social distancing measure in public or workplace setup. The framework further provides implementation guideline in industrial setup as well for governance and contact tracing tasks. The adoption will thus lead in smart city planning and development focusing on citizen health systems contributing to improved quality of life. The conceptual framework presented is validated through quantitative data analysis via secondary data collection from researcher's public websites, GitHub repositories and renowned journals and further benchmarking were conducted for experimental results in Microsoft Azure cloud environment. The study includes selective AI-models for benchmark analysis and were assessed on performance and accuracy in edge computing environment for large scale societal setup. Overall YOLO model Outperforms in object detection task and is faster enough for mask detection and HRNetV2 outperform semantic segmentation problem applied to solve social distancing task in AI-Edge inferencing environmental setup. The paper proposes new Edge-AI algorithm for building technology-oriented solutions for detecting mask in human movement and social distance. The paper enriches the technological advancement in artificial intelligence and edge-computing applied to problems in society and healthcare systems. The framework further equips government agency, system providers to design and constructs technology-oriented models in community setup to Increase the quality of life using emerging technologies into smart urban environments.

[42]  arXiv:2111.15210 [pdf, other]
Title: Point Cloud Instance Segmentation with Semi-supervised Bounding-Box Mining
Comments: IEEE Trans on Pattern Analysis and Machine Intelligence
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Point cloud instance segmentation has achieved huge progress with the emergence of deep learning. However, these methods are usually data-hungry with expensive and time-consuming dense point cloud annotations. To alleviate the annotation cost, unlabeled or weakly labeled data is still less explored in the task. In this paper, we introduce the first semi-supervised point cloud instance segmentation framework (SPIB) using both labeled and unlabelled bounding boxes as supervision. To be specific, our SPIB architecture involves a two-stage learning procedure. For stage one, a bounding box proposal generation network is trained under a semi-supervised setting with perturbation consistency regularization (SPCR). The regularization works by enforcing an invariance of the bounding box predictions over different perturbations applied to the input point clouds, to provide self-supervision for network learning. For stage two, the bounding box proposals with SPCR are grouped into some subsets, and the instance masks are mined inside each subset with a novel semantic propagation module and a property consistency graph module. Moreover, we introduce a novel occupancy ratio guided refinement module to refine the instance masks. Extensive experiments on the challenging ScanNet v2 dataset demonstrate our method can achieve competitive performance compared with the recent fully-supervised methods.

[43]  arXiv:2111.15213 [pdf, other]
Title: Using a GAN to Generate Adversarial Examples to Facial Image Recognition
Comments: 8 pages, to appear at the Media Watermarking, Security, and Forensics Conference at Electronic Imaging, January, 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Images posted online present a privacy concern in that they may be used as reference examples for a facial recognition system. Such abuse of images is in violation of privacy rights but is difficult to counter. It is well established that adversarial example images can be created for recognition systems which are based on deep neural networks. These adversarial examples can be used to disrupt the utility of the images as reference examples or training data. In this work we use a Generative Adversarial Network (GAN) to create adversarial examples to deceive facial recognition and we achieve an acceptable success rate in fooling the face recognition. Our results reduce the training time for the GAN by removing the discriminator component. Furthermore, our results show knowledge distillation can be employed to drastically reduce the size of the resulting model without impacting performance indicating that our contribution could run comfortably on a smartphone

[44]  arXiv:2111.15234 [pdf, other]
Title: NeRFReN: Neural Radiance Fields with Reflections
Comments: Project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

Neural Radiance Fields (NeRF) has achieved unprecedented view synthesis quality using coordinate-based neural scene representations. However, NeRF's view dependency can only handle simple reflections like highlights but cannot deal with complex reflections such as those from glass and mirrors. In these scenarios, NeRF models the virtual image as real geometries which leads to inaccurate depth estimation, and produces blurry renderings when the multi-view consistency is violated as the reflected objects may only be seen under some of the viewpoints. To overcome these issues, we introduce NeRFReN, which is built upon NeRF to model scenes with reflections. Specifically, we propose to split a scene into transmitted and reflected components, and model the two components with separate neural radiance fields. Considering that this decomposition is highly under-constrained, we exploit geometric priors and apply carefully-designed training strategies to achieve reasonable decomposition results. Experiments on various self-captured scenes show that our method achieves high-quality novel view synthesis and physically sound depth estimation results while enabling scene editing applications. Code and data will be released.

[45]  arXiv:2111.15242 [pdf, other]
Title: ConDA: Unsupervised Domain Adaptation for LiDAR Segmentation via Regularized Domain Concatenation
Comments: 12 pages, 7 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

Transferring knowledge learned from the labeled source domain to the raw target domain for unsupervised domain adaptation (UDA) is essential to the scalable deployment of an autonomous driving system. State-of-the-art approaches in UDA often employ a key concept: utilize joint supervision signals from both the source domain (with ground-truth) and the target domain (with pseudo-labels) for self-training. In this work, we improve and extend on this aspect. We present ConDA, a concatenation-based domain adaptation framework for LiDAR semantic segmentation that: (1) constructs an intermediate domain consisting of fine-grained interchange signals from both source and target domains without destabilizing the semantic coherency of objects and background around the ego-vehicle; and (2) utilizes the intermediate domain for self-training. Additionally, to improve both the network training on the source domain and self-training on the intermediate domain, we propose an anti-aliasing regularizer and an entropy aggregator to reduce the detrimental effects of aliasing artifacts and noisy target predictions. Through extensive experiments, we demonstrate that ConDA is significantly more effective in mitigating the domain gap compared to prior arts.

[46]  arXiv:2111.15246 [pdf, other]
Title: Hallucinated Neural Radiance Fields in the Wild
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Neural Radiance Fields (NeRF) has recently gained popularity for its impressive novel view synthesis ability. This paper studies the problem of hallucinated NeRF: i.e. recovering a realistic NeRF at a different time of day from a group of tourism images. Existing solutions adopt NeRF with a controllable appearance embedding to render novel views under various conditions, but cannot render view-consistent images with an unseen appearance. To solve this problem, we present an end-to-end framework for constructing a hallucinated NeRF, dubbed as H-NeRF. Specifically, we propose an appearance hallucination module to handle time-varying appearances and transfer them to novel views. Considering the complex occlusions of tourism images, an anti-occlusion module is introduced to decompose the static subjects for visibility accurately. Experimental results on synthetic data and real tourism photo collections demonstrate that our method can not only hallucinate the desired appearances, but also render occlusion-free images from different views. The project and supplementary materials are available at https://rover-xingyu.github.io/H-NeRF/.

[47]  arXiv:2111.15257 [pdf, other]
Title: ARTSeg: Employing Attention for Thermal images Semantic Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

The research advancements have made the neural network algorithms deployed in the autonomous vehicle to perceive the surrounding. The standard exteroceptive sensors that are utilized for the perception of the environment are cameras and Lidar. Therefore, the neural network algorithms developed using these exteroceptive sensors have provided the necessary solution for the autonomous vehicle's perception. One major drawback of these exteroceptive sensors is their operability in adverse weather conditions, for instance, low illumination and night conditions. The useability and affordability of thermal cameras in the sensor suite of the autonomous vehicle provide the necessary improvement in the autonomous vehicle's perception in adverse weather conditions. The semantics of the environment benefits the robust perception, which can be achieved by segmenting different objects in the scene. In this work, we have employed the thermal camera for semantic segmentation. We have designed an attention-based Recurrent Convolution Network (RCNN) encoder-decoder architecture named ARTSeg for thermal semantic segmentation. The main contribution of this work is the design of encoder-decoder architecture, which employ units of RCNN for each encoder and decoder block. Furthermore, additive attention is employed in the decoder module to retain high-resolution features and improve the localization of features. The efficacy of the proposed method is evaluated on the available public dataset, showing better performance with other state-of-the-art methods in mean intersection over union (IoU).

[48]  arXiv:2111.15263 [pdf, other]
Title: Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Linguistic knowledge has brought great benefits to scene text recognition by providing semantics to refine character sequences. However, since linguistic knowledge has been applied individually on the output sequence, previous methods have not fully utilized the semantics to understand visual clues for text recognition. This paper introduces a novel method, called Multi-modAl Text Recognition Network (MATRN), that enables interactions between visual and semantic features for better recognition performances. Specifically, MATRN identifies visual and semantic feature pairs and encodes spatial information into semantic features. Based on the spatial encoding, visual and semantic features are enhanced by referring to related features in the other modality. Furthermore, MATRN stimulates combining semantic features into visual features by hiding visual clues related to the character in the training phase. Our experiments demonstrate that MATRN achieves state-of-the-art performances on seven benchmarks with large margins, while naive combinations of two modalities show marginal improvements. Further ablative studies prove the effectiveness of our proposed components. Our implementation will be publicly available.

[49]  arXiv:2111.15264 [pdf, other]
Title: EdiBERT, a generative model for image editing
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Advances in computer vision are pushing the limits of im-age manipulation, with generative models sampling detailed images on various tasks. However, a specialized model is often developed and trained for each specific task, even though many image edition tasks share similarities. In denoising, inpainting, or image compositing, one always aims at generating a realistic image from a low-quality one. In this paper, we aim at making a step towards a unified approach for image editing. To do so, we propose EdiBERT, a bi-directional transformer trained in the discrete latent space built by a vector-quantized auto-encoder. We argue that such a bidirectional model is suited for image manipulation since any patch can be re-sampled conditionally to the whole image. Using this unique and straightforward training objective, we show that the resulting model matches state-of-the-art performances on a wide variety of tasks: image denoising, image completion, and image composition.

[50]  arXiv:2111.15266 [pdf, other]
Title: Two-stage Temporal Modelling Framework for Video-based Depression Recognition using Graph Representation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Video-based automatic depression analysis provides a fast, objective and repeatable self-assessment solution, which has been widely developed in recent years. While depression clues may be reflected by human facial behaviours of various temporal scales, most existing approaches either focused on modelling depression from short-term or video-level facial behaviours. In this sense, we propose a two-stage framework that models depression severity from multi-scale short-term and video-level facial behaviours. The short-term depressive behaviour modelling stage first deep learns depression-related facial behavioural features from multiple short temporal scales, where a Depression Feature Enhancement (DFE) module is proposed to enhance the depression-related clues for all temporal scales and remove non-depression noises. Then, the video-level depressive behaviour modelling stage proposes two novel graph encoding strategies, i.e., Sequential Graph Representation (SEG) and Spectral Graph Representation (SPG), to re-encode all short-term features of the target video into a video-level graph representation, summarizing depression-related multi-scale video-level temporal information. As a result, the produced graph representations predict depression severity using both short-term and long-term facial beahviour patterns. The experimental results on AVEC 2013 and AVEC 2014 datasets show that the proposed DFE module constantly enhanced the depression severity estimation performance for various CNN models while the SPG is superior than other video-level modelling methods. More importantly, the result achieved for the proposed two-stage framework shows its promising and solid performance compared to widely-used one-stage modelling approaches.

[51]  arXiv:2111.15271 [pdf, other]
Title: Affect-DML: Context-Aware One-Shot Recognition of Human Affect using Deep Metric Learning
Comments: Accepted to IEEE International Conference on Automatic Face and Gesture Recognition 2021 (FG2021). Benchmark, models, and code are at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Human affect recognition is a well-established research area with numerous applications, e.g., in psychological care, but existing methods assume that all emotions-of-interest are given a priori as annotated training examples. However, the rising granularity and refinements of the human emotional spectrum through novel psychological theories and the increased consideration of emotions in context brings considerable pressure to data collection and labeling work. In this paper, we conceptualize one-shot recognition of emotions in context -- a new problem aimed at recognizing human affect states in finer particle level from a single support sample. To address this challenging task, we follow the deep metric learning paradigm and introduce a multi-modal emotion embedding approach which minimizes the distance of the same-emotion embeddings by leveraging complementary information of human appearance and the semantic scene context obtained through a semantic segmentation network. All streams of our context-aware model are optimized jointly using weighted triplet loss and weighted cross entropy loss. We conduct thorough experiments on both, categorical and numerical emotion recognition tasks of the Emotic dataset adapted to our one-shot recognition problem, revealing that categorizing human affect from a single example is a hard task. Still, all variants of our model clearly outperform the random baseline, while leveraging the semantic scene context consistently improves the learnt representations, setting state-of-the-art results in one-shot emotion recognition. To foster research of more universal representations of human affect states, we will make our benchmark and models publicly available to the community under https://github.com/KPeng9510/Affect-DML.

[52]  arXiv:2111.15288 [pdf, other]
Title: Revisiting Temporal Alignment for Video Restoration
Comments: 15 pages. 17 figures, 10 tables/
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Long-range temporal alignment is critical yet challenging for video restoration tasks. Recently, some works attempt to divide the long-range alignment into several sub-alignments and handle them progressively. Although this operation is helpful in modeling distant correspondences, error accumulation is inevitable due to the propagation mechanism. In this work, we present a novel, generic iterative alignment module which employs a gradual refinement scheme for sub-alignments, yielding more accurate motion compensation. To further enhance the alignment accuracy and temporal consistency, we develop a non-parametric re-weighting method, where the importance of each neighboring frame is adaptively evaluated in a spatial-wise way for aggregation. By virtue of the proposed strategies, our model achieves state-of-the-art performance on multiple benchmarks across a range of video restoration tasks including video super-resolution, denoising and deblurring. Our project is available in \url{https://github.com/redrock303/Revisiting-Temporal-Alignment-for-Video-Restoration.git}.

[53]  arXiv:2111.15300 [pdf, other]
Title: TridentAdapt: Learning Domain-invariance via Source-Target Confrontation and Self-induced Cross-domain Augmentation
Comments: Accepted to BMVC2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Due to the difficulty of obtaining ground-truth labels, learning from virtual-world datasets is of great interest for real-world applications like semantic segmentation. From domain adaptation perspective, the key challenge is to learn domain-agnostic representation of the inputs in order to benefit from virtual data. In this paper, we propose a novel trident-like architecture that enforces a shared feature encoder to satisfy confrontational source and target constraints simultaneously, thus learning a domain-invariant feature space. Moreover, we also introduce a novel training pipeline enabling self-induced cross-domain data augmentation during the forward pass. This contributes to a further reduction of the domain gap. Combined with a self-training process, we obtain state-of-the-art results on benchmark datasets (e.g. GTA5 or Synthia to Cityscapes adaptation). Code and pre-trained models are available at https://github.com/HMRC-AEL/TridentAdapt

[54]  arXiv:2111.15318 [pdf, other]
Title: DiffSDFSim: Differentiable Rigid-Body Dynamics With Implicit Shapes
Comments: 22 pages, 23 Figures. Accepted for 3DV 2021. Project website: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)

Differentiable physics is a powerful tool in computer vision and robotics for scene understanding and reasoning about interactions. Existing approaches have frequently been limited to objects with simple shape or shapes that are known in advance. In this paper, we propose a novel approach to differentiable physics with frictional contacts which represents object shapes implicitly using signed distance fields (SDFs). Our simulation supports contact point calculation even when the involved shapes are nonconvex. Moreover, we propose ways for differentiating the dynamics for the object shape to facilitate shape optimization using gradient-based methods. In our experiments, we demonstrate that our approach allows for model-based inference of physical parameters such as friction coefficients, mass, forces or shape parameters from trajectory and depth image observations in several challenging synthetic scenarios and a real image sequence.

[55]  arXiv:2111.15340 [pdf, other]
Title: MC-SSL0.0: Towards Multi-Concept Self-Supervised Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Self-supervised pretraining is the method of choice for natural language processing models and is rapidly gaining popularity in many vision tasks. Recently, self-supervised pretraining has shown to outperform supervised pretraining for many downstream vision applications, marking a milestone in the area. This superiority is attributed to the negative impact of incomplete labelling of the training images, which convey multiple concepts, but are annotated using a single dominant class label. Although Self-Supervised Learning (SSL), in principle, is free of this limitation, the choice of pretext task facilitating SSL is perpetuating this shortcoming by driving the learning process towards a single concept output. This study aims to investigate the possibility of modelling all the concepts present in an image without using labels. In this aspect the proposed SSL frame-work MC-SSL0.0 is a step towards Multi-Concept Self-Supervised Learning (MC-SSL) that goes beyond modelling single dominant label in an image to effectively utilise the information from all the concepts present in it. MC-SSL0.0 consists of two core design concepts, group masked model learning and learning of pseudo-concept for data token using a momentum encoder (teacher-student) framework. The experimental results on multi-label and multi-class image classification downstream tasks demonstrate that MC-SSL0.0 not only surpasses existing SSL methods but also outperforms supervised transfer learning. The source code will be made publicly available for community to train on bigger corpus.

[56]  arXiv:2111.15341 [pdf, ps, other]
Title: ZZ-Net: A Universal Rotation Equivariant Architecture for 2D Point Clouds
Comments: 9 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

In this paper, we are concerned with rotation equivariance on 2D point cloud data. We describe a particular set of functions able to approximate any continuous rotation equivariant and permutation invariant function. Based on this result, we propose a novel neural network architecture for processing 2D point clouds and we prove its universality for approximating functions exhibiting these symmetries.
We also show how to extend the architecture to accept a set of 2D-2D correspondences as indata, while maintaining similar equivariance properties. Experiments are presented on the estimation of essential matrices in stereo vision.

[57]  arXiv:2111.15361 [pdf, other]
Title: Seeking Salient Facial Regions for Cross-Database Micro-Expression Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

This paper focuses on the research of cross-database micro-expression recognition, in which the training and test micro-expression samples belong to different microexpression databases. Mismatched feature distributions between the training and testing micro-expression feature degrade the performance of most well-performing micro-expression methods. To deal with cross-database micro-expression recognition, we propose a novel domain adaption method called Transfer Group Sparse Regression (TGSR). TGSR learns a sparse regression matrix for selecting salient facial local regions and the corresponding relationship of the training set and test set. We evaluate our TGSR model in CASME II and SMIC databases. Experimental results show that the proposed TGSR achieves satisfactory performance and outperforms most state-of-the-art subspace learning-based domain adaption methods.

[58]  arXiv:2111.15362 [pdf, other]
Title: ISNAS-DIP: Image-Specific Neural Architecture Search for Deep Image Prior
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent works show that convolutional neural network (CNN) architectures have a spectral bias towards lower frequencies, which has been leveraged for various image restoration tasks in the Deep Image Prior (DIP) framework. The benefit of the inductive bias the network imposes in the DIP framework depends on the architecture. Therefore, researchers have studied how to automate the search to determine the best-performing model. However, common neural architecture search (NAS) techniques are resource and time-intensive. Moreover, best-performing models are determined for a whole dataset of images instead of for each image independently, which would be prohibitively expensive. In this work, we first show that optimal neural architectures in the DIP framework are image-dependent. Leveraging this insight, we then propose an image-specific NAS strategy for the DIP framework that requires substantially less training than typical NAS approaches, effectively enabling image-specific NAS. For a given image, noise is fed to a large set of untrained CNNs, and their outputs' power spectral densities (PSD) are compared to that of the corrupted image using various metrics. Based on this, a small cohort of image-specific architectures is chosen and trained to reconstruct the corrupted image. Among this cohort, the model whose reconstruction is closest to the average of the reconstructed images is chosen as the final model. We justify the proposed strategy's effectiveness by (1) demonstrating its performance on a NAS Dataset for DIP that includes 500+ models from a particular search space (2) conducting extensive experiments on image denoising, inpainting, and super-resolution tasks. Our experiments show that image-specific metrics can reduce the search space to a small cohort of models, of which the best model outperforms current NAS approaches for image restoration.

[59]  arXiv:2111.15363 [pdf, other]
Title: Voint Cloud: Multi-View Point Cloud Representation for 3D Understanding
Comments: preprint
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Multi-view projection methods have demonstrated promising performance on 3D understanding tasks like 3D classification and segmentation. However, it remains unclear how to combine such multi-view methods with the widely available 3D point clouds. Previous methods use unlearned heuristics to combine features at the point level. To this end, we introduce the concept of the multi-view point cloud (Voint cloud), representing each 3D point as a set of features extracted from several view-points. This novel 3D Voint cloud representation combines the compactness of 3D point cloud representation with the natural view-awareness of multi-view representation. Naturally, we can equip this new representation with convolutional and pooling operations. We deploy a Voint neural network (VointNet) with a theoretically established functional form to learn representations in the Voint space. Our novel representation achieves state-of-the-art performance on 3D classification and retrieval on ScanObjectNN, ModelNet40, and ShapeNet Core55. Additionally, we achieve competitive performance for 3D semantic segmentation on ShapeNet Parts. Further analysis shows that VointNet improves the robustness to rotation and occlusion compared to other methods.

[60]  arXiv:2111.15376 [pdf]
Title: Reconstruction Student with Attention for Student-Teacher Pyramid Matching
Comments: 11 pages, 6 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Anomaly detection and localization are important problems in computer vision. Recently, Convolutional Neural Network (CNN) has been used for visual inspection. In particular, the scarcity of anomalous samples increases the difficulty of this task, and unsupervised leaning based methods are attracting attention. We focus on Student-Teacher Feature Pyramid Matching (STPM) which can be trained from only normal images with small number of epochs. Here we proposed a powerful method which compensates for the shortcomings of STPM. Proposed method consists of two students and two teachers that a pair of student-teacher network is the same as STPM. The other student-teacher network has a role to reconstruct the features of normal products. By reconstructing the features of normal products from an abnormal image, it is possible to detect abnormalities with higher accuracy by taking the difference between them. The new student-teacher network uses attention modules and different teacher network from the original STPM. Attention mechanism acts to successfully reconstruct the normal regions in an input image. Different teacher network prevents looking at the same regions as the original STPM. Six anomaly maps obtained from the two student-teacher networks are used to calculate the final anomaly map. Student-teacher network for reconstructing features improved AUC scores for pixel level and image level in comparison with the original STPM.

[61]  arXiv:2111.15400 [pdf, other]
Title: CT-block: a novel local and global features extractor for point cloud
Comments: 15 pages, 4 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Deep learning on the point cloud is increasingly developing. Grouping the point with its neighbors and conducting convolution-like operation on them can learn the local feature of the point cloud, but this method is weak to extract the long-distance global feature. Performing the attention-based transformer on the whole point cloud can effectively learn the global feature of it, but this method is hardly to extract the local detailed feature. In this paper, we propose a novel module that can simultaneously extract and fuse local and global features, which is named as CT-block. The CT-block is composed of two branches, where the letter C represents the convolution-branch and the letter T represents the transformer-branch. The convolution-branch performs convolution on the grouped neighbor points to extract the local feature. Meanwhile, the transformer-branch performs offset-attention process on the whole point cloud to extract the global feature. Through the bridge constructed by the feature transmission element in the CT-block, the local and global features guide each other during learning and are fused effectively. We apply the CT-block to construct point cloud classification and segmentation networks, and evaluate the performance of them by several public datasets. The experimental results show that, because the features learned by CT-block are much expressive, the performance of the networks constructed by the CT-block on the point cloud classification and segmentation tasks achieve state of the art.

[62]  arXiv:2111.15404 [pdf, other]
Title: Probabilistic Estimation of 3D Human Shape and Pose with a Semantic Local Parametric Model
Comments: BMVC 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper addresses the problem of 3D human body shape and pose estimation from RGB images. Some recent approaches to this task predict probability distributions over human body model parameters conditioned on the input images. This is motivated by the ill-posed nature of the problem wherein multiple 3D reconstructions may match the image evidence, particularly when some parts of the body are locally occluded. However, body shape parameters in widely-used body models (e.g. SMPL) control global deformations over the whole body surface. Distributions over these global shape parameters are unable to meaningfully capture uncertainty in shape estimates associated with locally-occluded body parts. In contrast, we present a method that (i) predicts distributions over local body shape in the form of semantic body measurements and (ii) uses a linear mapping to transform a local distribution over body measurements to a global distribution over SMPL shape parameters. We show that our method outperforms the current state-of-the-art in terms of identity-dependent body shape estimation accuracy on the SSP-3D dataset, and a private dataset of tape-measured humans, by probabilistically-combining local body measurement distributions predicted from multiple images of a subject.

[63]  arXiv:2111.15416 [pdf, other]
Title: A Face Recognition System's Worst Morph Nightmare, Theoretically
Subjects: Computer Vision and Pattern Recognition (cs.CV)

It has been shown that Face Recognition Systems (FRSs) are vulnerable to morphing attacks, but most research focusses on landmark-based morphs. A second method for generating morphs uses Generative Adversarial Networks, which results in convincingly real facial images that can be almost as challenging for FRSs as landmark-based attacks. We propose a method to create a third, different type of morph, that has the advantage of being easier to train. We introduce the theoretical concept of \textit{worst-case morphs}, which are those morphs that are most challenging for a fixed FRS. For a set of images and corresponding embeddings in an FRS's latent space, we generate images that approximate these worst-case morphs using a mapping from embedding space back to image space. While the resulting images are not yet as challenging as other morphs, they can provide valuable information in future research on Morphing Attack Detection (MAD) methods and on weaknesses of FRSs. Methods for MAD need to be validated on more varied morph databases. Our proposed method contributes to achieving such variation.

[64]  arXiv:2111.15430 [pdf, other]
Title: The Devil is in the Margin: Margin-based Label Smoothing for Network Calibration
Comments: 13 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

In spite of the dominant performances of deep neural networks, recent works have shown that they are poorly calibrated, resulting in over-confident predictions. Miscalibration can be exacerbated by overfitting due to the minimization of the cross-entropy during training, as it promotes the predicted softmax probabilities to match the one-hot label assignments. This yields a pre-softmax activation of the correct class that is significantly larger than the remaining activations. Recent evidence from the literature suggests that loss functions that embed implicit or explicit maximization of the entropy of predictions yield state-of-the-art calibration performances. We provide a unifying constrained-optimization perspective of current state-of-the-art calibration losses. Specifically, these losses could be viewed as approximations of a linear penalty (or a Lagrangian) imposing equality constraints on logit distances. This points to an important limitation of such underlying equality constraints, whose ensuing gradients constantly push towards a non-informative solution, which might prevent from reaching the best compromise between the discriminative performance and calibration of the model during gradient-based optimization. Following our observations, we propose a simple and flexible generalization based on inequality constraints, which imposes a controllable margin on logit distances. Comprehensive experiments on a variety of image classification, semantic segmentation and NLP benchmarks demonstrate that our method sets novel state-of-the-art results on these tasks in terms of network calibration, without affecting the discriminative performance. The code is available at https://github.com/by-liu/MbLS .

[65]  arXiv:2111.15438 [pdf, other]
Title: FMD-cGAN: Fast Motion Deblurring using Conditional Generative Adversarial Networks
Comments: International Conference on Computer Vision and Image Processing 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

In this paper, we present a Fast Motion Deblurring-Conditional Generative Adversarial Network (FMD-cGAN) that helps in blind motion deblurring of a single image. FMD-cGAN delivers impressive structural similarity and visual appearance after deblurring an image. Like other deep neural network architectures, GANs also suffer from large model size (parameters) and computations. It is not easy to deploy the model on resource constraint devices such as mobile and robotics. With the help of MobileNet based architecture that consists of depthwise separable convolution, we reduce the model size and inference time, without losing the quality of the images. More specifically, we reduce the model size by 3-60x compare to the nearest competitor. The resulting compressed Deblurring cGAN faster than its closest competitors and even qualitative and quantitative results outperform various recently proposed state-of-the-art blind motion deblurring models. We can also use our model for real-time image deblurring tasks. The current experiment on the standard datasets shows the effectiveness of the proposed method.

[66]  arXiv:2111.15449 [pdf, ps, other]
Title: A Softmax-free Loss Function Based on Predefined Optimal-distribution of Latent Features for CNN Classifier
Authors: Qiuyu Zhu, Xuewen Zu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

In the field of pattern classification, the training of convolutional neural network classifiers is mostly end-to-end learning, and the loss function is the constraint on the final output (posterior probability) of the network, so the existence of Softmax is essential. In the case of end-to-end learning, there is usually no effective loss function that completely relies on the features of the middle layer to restrict learning, resulting in the distribution of sample latent features is not optimal, so there is still room for improvement in classification accuracy. Based on the concept of Predefined Evenly-Distributed Class Centroids (PEDCC), this article proposes a Softmax-free loss function (POD Loss) based on predefined optimal-distribution of latent features. The loss function only restricts the latent features of the samples, including the cosine distance between the latent feature vector of the sample and the center of the predefined evenly-distributed class, and the correlation between the latent features of the samples. Finally, cosine distance is used for classification. Compared with the commonly used Softmax Loss and the typical Softmax related AM-Softmax Loss, COT-Loss and PEDCC-Loss, experiments on several commonly used datasets on a typical network show that the classification performance of POD Loss is always better and easier to converge. Code is available in https://github.com/TianYuZu/POD-Loss.

[67]  arXiv:2111.15451 [pdf, other]
Title: Large-Scale Video Analytics through Object-Level Consolidation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)

As the number of installed cameras grows, so do the compute resources required to process and analyze all the images captured by these cameras. Video analytics enables new use cases, such as smart cities or autonomous driving. At the same time, it urges service providers to install additional compute resources to cope with the demand while the strict latency requirements push compute towards the end of the network, forming a geographically distributed and heterogeneous set of compute locations, shared and resource-constrained. Such landscape (shared and distributed locations) forces us to design new techniques that can optimize and distribute work among all available locations and, ideally, make compute requirements grow sublinearly with respect to the number of cameras installed. In this paper, we present FoMO (Focus on Moving Objects). This method effectively optimizes multi-camera deployments by preprocessing images for scenes, filtering the empty regions out, and composing regions of interest from multiple cameras into a single image that serves as input for a pre-trained object detection model. Results show that overall system performance can be increased by 8x while accuracy improves 40% as a by-product of the methodology, all using an off-the-shelf pre-trained model with no additional training or fine-tuning.

[68]  arXiv:2111.15454 [pdf, other]
Title: Boosting Discriminative Visual Representation Learning with Scenario-Agnostic Mixup
Comments: Preprint under review. 8 pages main body, 6 pages appendix, 3 pages reference
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Mixup is a popular data-dependent augmentation technique for deep neural networks, which contains two sub-tasks, mixup generation and classification. The community typically confines mixup to supervised learning (SL) and the objective of generation sub-task is fixed to the sampled pairs instead of considering the whole data manifold. To overcome such limitations, we systematically study the objectives of two sub-tasks and propose Scenario-Agostic Mixup for both SL and Self-supervised Learning (SSL) scenarios, named SAMix. Specifically, we hypothesize and verify the core objective of mixup generation as optimizing the local smoothness between two classes subject to global discrimination from other classes. Based on this discovery, $\eta$-balanced mixup loss is proposed for complementary training of the two sub-tasks. Meanwhile, the generation sub-task is parameterized as an optimizable module, Mixer, which utilizes an attention mechanism to generate mixed samples without label dependency. Extensive experiments on SL and SSL tasks demonstrate that SAMix consistently outperforms leading methods by a large margin.

[69]  arXiv:2111.15463 [pdf, other]
Title: Consensus Synergizes with Memory: A Simple Approach for Anomaly Segmentation in Urban Scenes
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Anomaly segmentation is a crucial task for safety-critical applications, such as autonomous driving in urban scenes, where the goal is to detect out-of-distribution (OOD) objects with categories which are unseen during training. The core challenge of this task is how to distinguish hard in-distribution samples from OOD samples, which has not been explicitly discussed yet. In this paper, we propose a novel and simple approach named Consensus Synergizes with Memory (CosMe) to address this challenge, inspired by the psychology finding that groups perform better than individuals on memory tasks. The main idea is 1) building a memory bank which consists of seen prototypes extracted from multiple layers of the pre-trained segmentation model and 2) training an auxiliary model that mimics the behavior of the pre-trained model, and then measuring the consensus of their mid-level features as complementary cues that synergize with the memory bank. CosMe is good at distinguishing between hard in-distribution examples and OOD samples. Experimental results on several urban scene anomaly segmentation datasets show that CosMe outperforms previous approaches by large margins.

[70]  arXiv:2111.15475 [pdf]
Title: Natural Scene Text Editing Based on AI
Authors: Yujie Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In a recorded situation, textual information is crucial for scene interpretation and decision making. The ability to edit text directly on images has a number of advantages, including error correction, text restoration, and image reusability. This research shows how to change image text at the letter and digits level. I devised a two-part letters-digits network (LDN) to encode and decode digital images, as well as learn and transfer the font style of the source characters to the target characters. This method allows you to update the uppercase letters, lowercase letters and digits in the picture.

[71]  arXiv:2111.15479 [pdf]
Title: Analysis of Multiscale Wavelet-based Fractional Gradient-Anisotropic Diffusion Fusion for single hazy and underwater image enhancement
Authors: Uche A. Nnolim
Comments: 15 pages, 10 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This report presents the results of a multi-scale wavelet based scheme for single image de-hazing and underwater image enhancement. The scheme is fast and highly localized in addition to global enhancement of hazy images. A PDE-based formulation enables additional versatility as the iterative nature allows more flexibility for various types of images. Visual and objective results from experiments indicate that the proposed approach competes favourably or surpasses most of the state-of-the-art approaches.

[72]  arXiv:2111.15483 [pdf, other]
Title: Spatio-Temporal Multi-Flow Network for Video Frame Interpolation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Video frame interpolation (VFI) is currently a very active research topic, with applications spanning computer vision, post production and video encoding. VFI can be extremely challenging, particularly in sequences containing large motions, occlusions or dynamic textures, where existing approaches fail to offer perceptually robust interpolation performance. In this context, we present a novel deep learning based VFI method, ST-MFNet, based on a Spatio-Temporal Multi-Flow architecture. ST-MFNet employs a new multi-scale multi-flow predictor to estimate many-to-one intermediate flows, which are combined with conventional one-to-one optical flows to capture both large and complex motions. In order to enhance interpolation performance for various textures, a 3D CNN is also employed to model the content dynamics over an extended temporal window. Moreover, ST-MFNet has been trained within an ST-GAN framework, which was originally developed for texture synthesis, with the aim of further improving perceptual interpolation quality. Our approach has been comprehensively evaluated -- compared with fourteen state-of-the-art VFI algorithms -- clearly demonstrating that ST-MFNet consistently outperforms these benchmarks on varied and representative test datasets, with significant gains up to 1.09dB in PSNR for cases including large motions and dynamic textures. Project page: https://danielism97.github.io/ST-MFNet.

[73]  arXiv:2111.15490 [pdf, other]
Title: FENeRF: Face Editing in Neural Radiance Fields
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Previous portrait image generation methods roughly fall into two categories: 2D GANs and 3D-aware GANs. 2D GANs can generate high fidelity portraits but with low view consistency. 3D-aware GAN methods can maintain view consistency but their generated images are not locally editable. To overcome these limitations, we propose FENeRF, a 3D-aware generator that can produce view-consistent and locally-editable portrait images. Our method uses two decoupled latent codes to generate corresponding facial semantics and texture in a spatial aligned 3D volume with shared geometry. Benefiting from such underlying 3D representation, FENeRF can jointly render the boundary-aligned image and semantic mask and use the semantic mask to edit the 3D volume via GAN inversion. We further show such 3D representation can be learned from widely available monocular image and semantic mask pairs. Moreover, we reveal that joint learning semantics and texture helps to generate finer geometry. Our experiments demonstrate that FENeRF outperforms state-of-the-art methods in various face editing tasks.

[74]  arXiv:2111.15491 [pdf, other]
Title: PolyWorld: Polygonal Building Extraction with Graph Neural Networks in Satellite Images
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Most state-of-the-art instance segmentation methods produce binary segmentation masks, however, geographic and cartographic applications typically require precise vector polygons of extracted objects instead of rasterized output. This paper introduces PolyWorld, a neural network that directly extracts building vertices from an image and connects them correctly to create precise polygons. The model predicts the connection strength between each pair of vertices using a graph neural network and estimates the assignments by solving a differentiable optimal transport problem. Moreover, the vertex positions are optimized by minimizing a combined segmentation and polygonal angle difference loss. PolyWorld significantly outperforms the state-of-the-art in building polygonization and achieves not only notable quantitative results, but also produces visually pleasing building polygons. Code and trained weights will be soon available on github.

[75]  arXiv:2111.15509 [pdf, other]
Title: Regularized directional representations for medical image registration
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In image registration, many efforts have been devoted to the development of alternatives to the popular normalized mutual information criterion. Concurrently to these efforts, an increasing number of works have demonstrated that substantial gains in registration accuracy can also be achieved by aligning structural representations of images rather than images themselves. Following this research path, we propose a new method for mono- and multimodal image registration based on the alignment of regularized vector fields derived from structural information such as gradient vector flow fields, a technique we call \textit{vector field similarity}. Our approach can be combined in a straightforward fashion with any existing registration framework by substituting vector field similarity to intensity-based registration. In our experiments, we show that the proposed approach compares favourably with conventional image alignment on several public image datasets using a diversity of imaging modalities and anatomical locations.

[76]  arXiv:2111.15510 [pdf, other]
Title: ESL: Event-based Structured Light
Journal-ref: International Conference on 3D Vision (3DV), Online, 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Event cameras are bio-inspired sensors providing significant advantages over standard cameras such as low latency, high temporal resolution, and high dynamic range. We propose a novel structured-light system using an event camera to tackle the problem of accurate and high-speed depth sensing. Our setup consists of an event camera and a laser-point projector that uniformly illuminates the scene in a raster scanning pattern during 16 ms. Previous methods match events independently of each other, and so they deliver noisy depth estimates at high scanning speeds in the presence of signal latency and jitter. In contrast, we optimize an energy function designed to exploit event correlations, called spatio-temporal consistency. The resulting method is robust to event jitter and therefore performs better at higher scanning speeds. Experiments demonstrate that our method can deal with high-speed motion and outperform state-of-the-art 3D reconstruction methods based on event cameras, reducing the RMSE by 83% on average, for the same acquisition time.

[77]  arXiv:2111.15513 [pdf, other]
Title: RADU: Ray-Aligned Depth Update Convolutions for ToF Data Denoising
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Time-of-Flight (ToF) cameras are subject to high levels of noise and distortions due to Multi-Path-Interference (MPI). While recent research showed that 2D neural networks are able to outperform previous traditional State-of-the-Art (SOTA) methods on denoising ToF-Data, little research on learning-based approaches has been done to make direct use of the 3D information present in depth images. In this paper, we propose an iterative denoising approach operating in 3D space, that is designed to learn on 2.5D data by enabling 3D point convolutions to correct the points' positions along the view direction. As labeled real world data is scarce for this task, we further train our network with a self-training approach on unlabeled real world data to account for real world statistics. We demonstrate that our method is able to outperform SOTA methods on several datasets, including two real world datasets and a new large-scale synthetic data set introduced in this paper.

[78]  arXiv:2111.15514 [pdf]
Title: Nonlinear Intensity Underwater Sonar Image Matching Method Based on Phase Information and Deep Convolution Features
Comments: 6 pages, letters, 9 figures. arXiv admin note: substantial text overlap with arXiv:2111.08994
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In the field of deep-sea exploration, sonar is presently the only efficient long-distance sensing device. The complicated underwater environment, such as noise interference, low target intensity or background dynamics, has brought many negative effects on sonar imaging. Among them, the problem of nonlinear intensity is extremely prevalent. It is also known as the anisotropy of acoustic sensor imaging, that is, when autonomous underwater vehicles (AUVs) carry sonar to detect the same target from different angles, the intensity variation between image pairs is sometimes very large, which makes the traditional matching algorithm almost ineffective. However, image matching is the basis of comprehensive tasks such as navigation, positioning, and mapping. Therefore, it is very valuable to obtain robust and accurate matching results. This paper proposes a combined matching method based on phase information and deep convolution features. It has two outstanding advantages: one is that the deep convolution features could be used to measure the similarity of the local and global positions of the sonar image; the other is that local feature matching could be performed at the key target position of the sonar image. This method does not need complex manual designs, and completes the matching task of nonlinear intensity sonar images in a close end-to-end manner. Feature matching experiments are carried out on the deep-sea sonar images captured by AUVs, and the results show that our proposal has preeminent matching accuracy and robustness.

[79]  arXiv:2111.15552 [pdf, other]
Title: NeuSample: Neural Sample Field for Efficient View Synthesis
Comments: Project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

Neural radiance fields (NeRF) have shown great potentials in representing 3D scenes and synthesizing novel views, but the computational overhead of NeRF at the inference stage is still heavy. To alleviate the burden, we delve into the coarse-to-fine, hierarchical sampling procedure of NeRF and point out that the coarse stage can be replaced by a lightweight module which we name a neural sample field. The proposed sample field maps rays into sample distributions, which can be transformed into point coordinates and fed into radiance fields for volume rendering. The overall framework is named as NeuSample. We perform experiments on Realistic Synthetic 360$^{\circ}$ and Real Forward-Facing, two popular 3D scene sets, and show that NeuSample achieves better rendering quality than NeRF while enjoying a faster inference speed. NeuSample is further compressed with a proposed sample field extraction method towards a better trade-off between quality and speed.

[80]  arXiv:2111.15557 [pdf, other]
Title: Low-light Image Enhancement via Breaking Down the Darkness
Comments: 9 pages, 9 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Images captured in low-light environment often suffer from complex degradation. Simply adjusting light would inevitably result in burst of hidden noise and color distortion. To seek results with satisfied lighting, cleanliness, and realism from degraded inputs, this paper presents a novel framework inspired by the divide-and-rule principle, greatly alleviating the degradation entanglement. Assuming that an image can be decomposed into texture (with possible noise) and color components, one can specifically execute noise removal and color correction along with light adjustment. Towards this purpose, we propose to convert an image from the RGB space into a luminance-chrominance one. An adjustable noise suppression network is designed to eliminate noise in the brightened luminance, having the illumination map estimated to indicate noise boosting levels. The enhanced luminance further serves as guidance for the chrominance mapper to generate realistic colors. Extensive experiments are conducted to reveal the effectiveness of our design, and demonstrate its superiority over state-of-the-art alternatives both quantitatively and qualitatively on several benchmark datasets. Our code is publicly available at https://github.com/mingcv/Bread.

[81]  arXiv:2111.15581 [pdf, other]
Title: Automated Damage Inspection of Power Transmission Towers from UAV Images
Comments: 8 pages, 10 figures, accepted for VISAPP 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Infrastructure inspection is a very costly task, requiring technicians to access remote or hard-to-reach places. This is the case for power transmission towers, which are sparsely located and require trained workers to climb them to search for damages. Recently, the use of drones or helicopters for remote recording is increasing in the industry, sparing the technicians this perilous task. This, however, leaves the problem of analyzing big amounts of images, which has great potential for automation. This is a challenging task for several reasons. First, the lack of freely available training data and the difficulty to collect it complicate this problem. Additionally, the boundaries of what constitutes a damage are fuzzy, introducing a degree of subjectivity in the labelling of the data. The unbalanced class distribution in the images also plays a role in increasing the difficulty of the task. This paper tackles the problem of structural damage detection in transmission towers, addressing these issues. Our main contributions are the development of a system for damage detection on remotely acquired drone images, applying techniques to overcome the issue of data scarcity and ambiguity, as well as the evaluation of the viability of such an approach to solve this particular problem.

[82]  arXiv:2111.15592 [pdf, other]
Title: MapReader: A Computer Vision Pipeline for the Semantic Exploration of Maps at Scale
Comments: 13 pages, 9 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Software Engineering (cs.SE)

We present MapReader, a free, open-source software library written in Python for analyzing large map collections (scanned or born-digital). This library transforms the way historians can use maps by turning extensive, homogeneous map sets into searchable primary sources. MapReader allows users with little or no computer vision expertise to i) retrieve maps via web-servers; ii) preprocess and divide them into patches; iii) annotate patches; iv) train, fine-tune, and evaluate deep neural network models; and v) create structured data about map content. We demonstrate how MapReader enables historians to interpret a collection of $\approx$16K nineteenth-century Ordnance Survey map sheets ($\approx$30.5M patches), foregrounding the challenge of translating visual markers into machine-readable data. We present a case study focusing on British rail infrastructure and buildings as depicted on these maps. We also show how the outputs from the MapReader pipeline can be linked to other, external datasets, which we use to evaluate as well as enrich and interpret the results. We release $\approx$62K manually annotated patches used here for training and evaluating the models.

[83]  arXiv:2111.15603 [pdf, other]
Title: Human Imperceptible Attacks and Applications to Improve Fairness
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Modern neural networks are able to perform at least as well as humans in numerous tasks involving object classification and image generation. However, small perturbations which are imperceptible to humans may significantly degrade the performance of well-trained deep neural networks. We provide a Distributionally Robust Optimization (DRO) framework which integrates human-based image quality assessment methods to design optimal attacks that are imperceptible to humans but significantly damaging to deep neural networks. Through extensive experiments, we show that our attack algorithm generates better-quality (less perceptible to humans) attacks than other state-of-the-art human imperceptible attack methods. Moreover, we demonstrate that DRO training using our optimally designed human imperceptible attacks can improve group fairness in image classification. Towards the end, we provide an algorithmic implementation to speed up DRO training significantly, which could be of independent interest.

[84]  arXiv:2111.15606 [pdf, other]
Title: Robust Partial-to-Partial Point Cloud Registration in a Full Range
Comments: 11 pages, 8 figures. Github Website: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Point cloud registration for 3D objects is very challenging due to sparse and noisy measurements, incomplete observations and large transformations. In this work, we propose Graph Matching Consensus Network (GMCNet), which estimates pose-invariant correspondences for fullrange 1 Partial-to-Partial point cloud Registration (PPR). To encode robust point descriptors, 1) we first comprehensively investigate transformation-robustness and noiseresilience of various geometric features. 2) Then, we employ a novel Transformation-robust Point Transformer (TPT) modules to adaptively aggregate local features regarding the structural relations, which takes advantage from both handcrafted rotation-invariant ($RI$) features and noise-resilient spatial coordinates. 3) Based on a synergy of hierarchical graph networks and graphical modeling, we propose the Hierarchical Graphical Modeling (HGM) architecture to encode robust descriptors consisting of i) a unary term learned from $RI$ features; and ii) multiple smoothness terms encoded from neighboring point relations at different scales through our TPT modules. Moreover, we construct a challenging PPR dataset (MVP-RG) with virtual scans. Extensive experiments show that GMCNet outperforms previous state-of-the-art methods for PPR. Remarkably, GMCNet encodes point descriptors for each point cloud individually without using crosscontextual information, or ground truth correspondences for training. Our code and datasets will be available at https://github.com/paul007pl/GMCNet.

[85]  arXiv:2111.15613 [pdf, other]
Title: The MIS Check-Dam Dataset for Object Detection and Instance Segmentation Tasks
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Deep learning has led to many recent advances in object detection and instance segmentation, among other computer vision tasks. These advancements have led to wide application of deep learning based methods and related methodologies in object detection tasks for satellite imagery. In this paper, we introduce MIS Check-Dam, a new dataset of check-dams from satellite imagery for building an automated system for the detection and mapping of check-dams, focusing on the importance of irrigation structures used for agriculture. We review some of the most recent object detection and instance segmentation methods and assess their performance on our new dataset. We evaluate several single stage, two-stage and attention based methods under various network configurations and backbone architectures. The dataset and the pre-trained models are available at https://www.cse.iitb.ac.in/gramdrishti/.

[86]  arXiv:2111.15615 [pdf, other]
Title: Semi-Local Convolutions for LiDAR Scan Processing
Comments: arXiv admin note: text overlap with arXiv:2004.11803
Journal-ref: ICBINB Workshop at NeurIPS 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

A number of applications, such as mobile robots or automated vehicles, use LiDAR sensors to obtain detailed information about their three-dimensional surroundings. Many methods use image-like projections to efficiently process these LiDAR measurements and use deep convolutional neural networks to predict semantic classes for each point in the scan. The spatial stationary assumption enables the usage of convolutions. However, LiDAR scans exhibit large differences in appearance over the vertical axis. Therefore, we propose semi local convolution (SLC), a convolution layer with reduced amount of weight-sharing along the vertical dimension. We are first to investigate the usage of such a layer independent of any other model changes. Our experiments did not show any improvement over traditional convolution layers in terms of segmentation IoU or accuracy.

[87]  arXiv:2111.15624 [pdf, other]
Title: Image Style Transfer and Content-Style Disentanglement
Comments: 10 pages, 6 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

We propose a way of learning disentangled content-style representation of image, allowing us to extrapolate images to any style as well as interpolate between any pair of styles. By augmenting data set in a supervised setting and imposing triplet loss, we ensure the separation of information encoded by content and style representation. We also make use of cycle-consistency loss to guarantee that images could be reconstructed faithfully by their representation.

[88]  arXiv:2111.15637 [pdf]
Title: BuildFormer: Automatic building extraction with vision transformer
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Building extraction from fine-resolution remote sensing images plays a vital role in numerous geospatial applications, such as urban planning, population statistic, economic assessment and disaster management. With the advancement of deep learning technology, deep convolutional neural networks (DCNNs) have dominated the automatic building extraction task for many years. However, the local property of DCNNs limits the extraction of global information, weakening the ability of the network for recognizing the building instance. Recently, the Transformer comprises a hot topic in the computer vision domain and achieves state-of-the-art performance in fundamental vision tasks, such as image classification, semantic segmentation and object detection. Inspired by this, in this paper, we propose a novel transformer-based network for extracting buildings from fine-resolution remote sensing images, namely BuildFormer. In Comparision with the ResNet, the proposed method achieves an improvement of 2% in mIoU on the WHU building dataset.

[89]  arXiv:2111.15639 [pdf, other]
Title: DeDUCE: Generating Counterfactual Explanations Efficiently
Comments: Presented at the 1st Workshop on eXplainable AI approaches for debugging and diagnosis (XAI4Debugging@NeurIPS2021)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)

When an image classifier outputs a wrong class label, it can be helpful to see what changes in the image would lead to a correct classification. This is the aim of algorithms generating counterfactual explanations. However, there is no easily scalable method to generate such counterfactuals. We develop a new algorithm providing counterfactual explanations for large image classifiers trained with spectral normalisation at low computational cost. We empirically compare this algorithm against baselines from the literature; our novel algorithm consistently finds counterfactuals that are much closer to the original inputs. At the same time, the realism of these counterfactuals is comparable to the baselines. The code for all experiments is available at https://github.com/benedikthoeltgen/DeDUCE.

[90]  arXiv:2111.15640 [pdf, other]
Title: Diffusion Autoencoders: Toward a Meaningful and Decodable Representation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Diffusion probabilistic models (DPMs) have achieved remarkable quality in image generation that rivals GANs'. But unlike GANs, DPMs use a set of latent variables that lack semantic meaning and cannot serve as a useful representation for other tasks. This paper explores the possibility of using DPMs for representation learning and seeks to extract a meaningful and decodable representation of an input image via autoencoding. Our key idea is to use a learnable encoder for discovering the high-level semantics, and a DPM as the decoder for modeling the remaining stochastic variations. Our method can encode any image into a two-part latent code, where the first part is semantically meaningful and linear, and the second part captures stochastic details, allowing near-exact reconstruction. This capability enables challenging applications that currently foil GAN-based methods, such as attribute manipulation on real images. We also show that this two-level encoding improves denoising efficiency and naturally facil itates various downstream tasks including few-shot conditional sampling.

[91]  arXiv:2111.15651 [pdf, other]
Title: Leveraging The Topological Consistencies of Learning in Deep Neural Networks
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Recently, methods have been developed to accurately predict the testing performance of a Deep Neural Network (DNN) on a particular task, given statistics of its underlying topological structure. However, further leveraging this newly found insight for practical applications is intractable due to the high computational cost in terms of time and memory. In this work, we define a new class of topological features that accurately characterize the progress of learning while being quick to compute during running time. Additionally, our proposed topological features are readily equipped for backpropagation, meaning that they can be incorporated in end-to-end training. Our newly developed practical topological characterization of DNNs allows for an additional set of applications. We first show we can predict the performance of a DNN without a testing set and without the need for high-performance computing. We also demonstrate our topological characterization of DNNs is effective in estimating task similarity. Lastly, we show we can induce learning in DNNs by actively constraining the DNN's topological structure. This opens up new avenues in constricting the underlying structure of DNNs in a meta-learning framework.

[92]  arXiv:2111.15656 [pdf, other]
Title: Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D Object Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV)

3D object detection networks tend to be biased towards the data they are trained on. Evaluation on datasets captured in different locations, conditions or sensors than that of the training (source) data results in a drop in model performance due to the gap in distribution with the test (or target) data. Current methods for domain adaptation either assume access to source data during training, which may not be available due to privacy or memory concerns, or require a sequence of lidar frames as an input. We propose a single-frame approach for source-free, unsupervised domain adaptation of lidar-based 3D object detectors that uses class prototypes to mitigate the effect pseudo-label noise. Addressing the limitations of traditional feature aggregation methods for prototype computation in the presence of noisy labels, we utilize a transformer module to identify outlier ROI's that correspond to incorrect, over-confident annotations, and compute an attentive class prototype. Under an iterative training strategy, the losses associated with noisy pseudo labels are down-weighed and thus refined in the process of self-training. To validate the effectiveness of our proposed approach, we examine the domain shift associated with networks trained on large, label-rich datasets (such as the Waymo Open Dataset and nuScenes) and evaluate on smaller, label-poor datasets (such as KITTI) and vice-versa. We demonstrate our approach on two recent object detectors and achieve results that out-perform the other domain adaptation works.

[93]  arXiv:2111.15666 [pdf, other]
Title: HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing
Comments: Project page available at this http URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The inversion of real images into StyleGAN's latent space is a well-studied problem. Nevertheless, applying existing approaches to real-world scenarios remains an open challenge, due to an inherent trade-off between reconstruction and editability: latent space regions which can accurately represent real images typically suffer from degraded semantic control. Recent work proposes to mitigate this trade-off by fine-tuning the generator to add the target image to well-behaved, editable regions of the latent space. While promising, this fine-tuning scheme is impractical for prevalent use as it requires a lengthy training phase for each new image. In this work, we introduce this approach into the realm of encoder-based inversion. We propose HyperStyle, a hypernetwork that learns to modulate StyleGAN's weights to faithfully express a given image in editable regions of the latent space. A naive modulation approach would require training a hypernetwork with over three billion parameters. Through careful network design, we reduce this to be in line with existing encoders. HyperStyle yields reconstructions comparable to those of optimization techniques with the near real-time inference capabilities of encoders. Lastly, we demonstrate HyperStyle's effectiveness on several applications beyond the inversion task, including the editing of out-of-domain images which were never seen during training.

[94]  arXiv:2111.15667 [pdf, other]
Title: ATS: Adaptive Token Sampling For Efficient Vision Transformers
Subjects: Computer Vision and Pattern Recognition (cs.CV)

While state-of-the-art vision transformer models achieve promising results for image classification, they are computationally very expensive and require many GFLOPs. Although the GFLOPs of a vision transformer can be decreased by reducing the number of tokens in the network, there is no setting that is optimal for all input images. In this work, we, therefore, introduce a differentiable parameter-free Adaptive Token Sampling (ATS) module, which can be plugged into any existing vision transformer architecture. ATS empowers vision transformers by scoring and adaptively sampling significant tokens. As a result, the number of tokens is not anymore static but it varies for each input image. By integrating ATS as an additional layer within current transformer blocks, we can convert them into much more efficient vision transformers with an adaptive number of tokens. Since ATS is a parameter-free module, it can be added to off-the-shelf pretrained vision transformers as a plug-and-play module, thus reducing their GFLOPs without any additional training. However, due to its differentiable design, one can also train a vision transformer equipped with ATS. We evaluate our module on the ImageNet dataset by adding it to multiple state-of-the-art vision transformers. Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by 37% while preserving the accuracy.

[95]  arXiv:2111.15668 [pdf, other]
Title: AdaViT: Adaptive Vision Transformers for Efficient Image Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Built on top of self-attention mechanisms, vision transformers have demonstrated remarkable performance on a variety of vision tasks recently. While achieving excellent performance, they still require relatively intensive computational cost that scales up drastically as the numbers of patches, self-attention heads and transformer blocks increase. In this paper, we argue that due to the large variations among images, their need for modeling long-range dependencies between patches differ. To this end, we introduce AdaViT, an adaptive computation framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use throughout the backbone on a per-input basis, aiming to improve inference efficiency of vision transformers with a minimal drop of accuracy for image recognition. Optimized jointly with a transformer backbone in an end-to-end manner, a light-weight decision network is attached to the backbone to produce decisions on-the-fly. Extensive experiments on ImageNet demonstrate that our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8% drop of accuracy, achieving good efficiency/accuracy trade-offs conditioned on different computational budgets. We further conduct quantitative and qualitative analysis on learned usage polices and provide more insights on the redundancy in vision transformers.

[96]  arXiv:2111.15669 [pdf, other]
Title: 360MonoDepth: High-Resolution 360° Monocular Depth Estimation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

360{\deg} cameras can capture complete environments in a single shot, which makes 360{\deg} imagery alluring in many computer vision tasks. However, monocular depth estimation remains a challenge for 360{\deg} data, particularly for high resolutions like 2K (2048$\times$1024) that are important for novel-view synthesis and virtual reality applications. Current CNN-based methods do not support such high resolutions due to limited GPU memory. In this work, we propose a flexible framework for monocular depth estimation from high-resolution 360{\deg} images using tangent images. We project the 360{\deg} input image onto a set of tangent planes that produce perspective views, which are suitable for the latest, most accurate state-of-the-art perspective monocular depth estimators. We recombine the individual depth estimates using deformable multi-scale alignment followed by gradient-domain blending to improve the consistency of disparity estimates. The result is a dense, high-resolution 360{\deg} depth map with a high level of detail, also for outdoor scenes which are not supported by existing methods.

[97]  arXiv:2111.15672 [pdf, other]
Title: Unsupervised Domain Adaptation: A Reality Check
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Interest in unsupervised domain adaptation (UDA) has surged in recent years, resulting in a plethora of new algorithms. However, as is often the case in fast-moving fields, baseline algorithms are not tested to the extent that they should be. Furthermore, little attention has been paid to validation methods, i.e. the methods for estimating the accuracy of a model in the absence of target domain labels. This is despite the fact that validation methods are a crucial component of any UDA train/val pipeline. In this paper, we show via large-scale experimentation that 1) in the oracle setting, the difference in accuracy between UDA algorithms is smaller than previously thought, 2) state-of-the-art validation methods are not well-correlated with accuracy, and 3) differences between UDA algorithms are dwarfed by the drop in accuracy caused by validation methods.

Cross-lists for Wed, 1 Dec 21

[98]  arXiv:2111.14843 (cross-list from cs.SD) [pdf, other]
Title: Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments with Moving Sounds
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO); Audio and Speech Processing (eess.AS)

Audio-visual navigation combines sight and hearing to navigate to a sound-emitting source in an unmapped environment. While recent approaches have demonstrated the benefits of audio input to detect and find the goal, they focus on clean and static sound sources and struggle to generalize to unheard sounds. In this work, we propose the novel dynamic audio-visual navigation benchmark which requires to catch a moving sound source in an environment with noisy and distracting sounds. We introduce a reinforcement learning approach that learns a robust navigation policy for these complex settings. To achieve this, we propose an architecture that fuses audio-visual information in the spatial feature space to learn correlations of geometric information inherent in both local maps and audio signals. We demonstrate that our approach consistently outperforms the current state-of-the-art by a large margin across all tasks of moving sounds, unheard sounds, and noisy environments, on two challenging 3D scanned real-world environments, namely Matterport3D and Replica. The benchmark is available at this http URL

[99]  arXiv:2111.14934 (cross-list from cs.GR) [pdf, other]
Title: GAN-CNMP: An Interactive Generative Drawing Tool
Comments: 9 pages, 10 figures
Subjects: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

Sketches are abstract representations of visual perception and visuospatial construction. In this work, we proposed a new framework, GAN-CNMP, that incorporates a novel adversarial loss on CNMP to increase sketch smoothness and consistency. Through the experiments, we show that our model can be trained with few unlabeled samples, can construct distributions automatically in the latent space, and produces better results than the base model in terms of shape consistency and smoothness.

[100]  arXiv:2111.14953 (cross-list from eess.IV) [pdf, other]
Title: Localized Perturbations For Weakly-Supervised Segmentation of Glioma Brain Tumours
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Deep convolutional neural networks (CNNs) have become an essential tool in the medical imaging-based computer-aided diagnostic pipeline. However, training accurate and reliable CNNs requires large fine-grain annotated datasets. To alleviate this, weakly-supervised methods can be used to obtain local information from global labels. This work proposes the use of localized perturbations as a weakly-supervised solution to extract segmentation masks of brain tumours from a pretrained 3D classification model. Furthermore, we propose a novel optimal perturbation method that exploits 3D superpixels to find the most relevant area for a given classification using a U-net architecture. Our method achieved a Dice similarity coefficient (DSC) of 0.44 when compared with expert annotations. When compared against Grad-CAM, our method outperformed both in visualization and localization ability of the tumour region, with Grad-CAM only achieving 0.11 average DSC.

[101]  arXiv:2111.14959 (cross-list from eess.IV) [pdf, other]
Title: Improving the Segmentation of Pediatric Low-Grade Gliomas through Multitask Learning
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Brain tumor segmentation is a critical task for tumor volumetric analyses and AI algorithms. However, it is a time-consuming process and requires neuroradiology expertise. While there has been extensive research focused on optimizing brain tumor segmentation in the adult population, studies on AI guided pediatric tumor segmentation are scarce. Furthermore, MRI signal characteristics of pediatric and adult brain tumors differ, necessitating the development of segmentation algorithms specifically designed for pediatric brain tumors. We developed a segmentation model trained on magnetic resonance imaging (MRI) of pediatric patients with low-grade gliomas (pLGGs) from The Hospital for Sick Children (Toronto, Ontario, Canada). The proposed model utilizes deep Multitask Learning (dMTL) by adding tumor's genetic alteration classifier as an auxiliary task to the main network, ultimately improving the accuracy of the segmentation results.

[102]  arXiv:2111.15099 (cross-list from cs.LG) [pdf, other]
Title: Trust the Critics: Generatorless and Multipurpose WGANs with Initial Convergence Guarantees
Comments: 20 pages, 8 figures
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)

Inspired by ideas from optimal transport theory we present Trust the Critics (TTC), a new algorithm for generative modelling. This algorithm eliminates the trainable generator from a Wasserstein GAN; instead, it iteratively modifies the source data using gradient descent on a sequence of trained critic networks. This is motivated in part by the misalignment which we observed between the optimal transport directions provided by the gradients of the critic and the directions in which data points actually move when parametrized by a trainable generator. Previous work has arrived at similar ideas from different viewpoints, but our basis in optimal transport theory motivates the choice of an adaptive step size which greatly accelerates convergence compared to a constant step size. Using this step size rule, we prove an initial geometric convergence rate in the case of source distributions with densities. These convergence rates cease to apply only when a non-negligible set of generated data is essentially indistinguishable from real data. Resolving the misalignment issue improves performance, which we demonstrate in experiments that show that given a fixed number of training epochs, TTC produces higher quality images than a comparable WGAN, albeit at increased memory requirements. In addition, TTC provides an iterative formula for the transformed density, which traditional WGANs do not. Finally, TTC can be applied to map any source distribution onto any target; we demonstrate through experiments that TTC can obtain competitive performance in image generation, translation, and denoising without dedicated algorithms.

[103]  arXiv:2111.15133 (cross-list from cs.LG) [pdf, other]
Title: LossPlot: A Better Way to Visualize Loss Landscapes
Comments: 5 pages; 2 large figures
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

Investigations into the loss landscapes of deep neural networks are often laborious. This work documents our user-driven approach to create a platform for semi-automating this process. LossPlot accepts data in the form of a csv, and allows multiple trained minimizers of the loss function to be manipulated in sync. Other features include a simple yet intuitive checkbox UI, summary statistics, and the ability to control clipping which other methods do not offer.

[104]  arXiv:2111.15179 (cross-list from cs.LG) [pdf, other]
Title: A Highly Effective Low-Rank Compression of Deep Neural Networks with Modified Beam-Search and Modified Stable Rank
Comments: 8 pages, 8 figures, 2 tables
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Compression has emerged as one of the essential deep learning research topics, especially for the edge devices that have limited computation power and storage capacity. Among the main compression techniques, low-rank compression via matrix factorization has been known to have two problems. First, an extensive tuning is required. Second, the resulting compression performance is typically not impressive. In this work, we propose a low-rank compression method that utilizes a modified beam-search for an automatic rank selection and a modified stable rank for a compression-friendly training. The resulting BSR (Beam-search and Stable Rank) algorithm requires only a single hyperparameter to be tuned for the desired compression ratio. The performance of BSR in terms of accuracy and compression ratio trade-off curve turns out to be superior to the previously known low-rank compression methods. Furthermore, BSR can perform on par with or better than the state-of-the-art structured pruning methods. As with pruning, BSR can be easily combined with quantization for an additional compression.

[105]  arXiv:2111.15186 (cross-list from cs.LG) [pdf, other]
Title: Automatic Synthesis of Diverse Weak Supervision Sources for Behavior Analysis
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Obtaining annotations for large training sets is expensive, especially in behavior analysis settings where domain knowledge is required for accurate annotations. Weak supervision has been studied to reduce annotation costs by using weak labels from task-level labeling functions to augment ground truth labels. However, domain experts are still needed to hand-craft labeling functions for every studied task. To reduce expert effort, we present AutoSWAP: a framework for automatically synthesizing data-efficient task-level labeling functions. The key to our approach is to efficiently represent expert knowledge in a reusable domain specific language and domain-level labeling functions, with which we use state-of-the-art program synthesis techniques and a small labeled dataset to generate labeling functions. Additionally, we propose a novel structural diversity cost that allows for direct synthesis of diverse sets of labeling functions with minimal overhead, further improving labeling function data efficiency. We evaluate AutoSWAP in three behavior analysis domains and demonstrate that AutoSWAP outperforms existing approaches using only a fraction of the data. Our results suggest that AutoSWAP is an effective way to automatically generate labeling functions that can significantly reduce expert effort for behavior analysis.

[106]  arXiv:2111.15200 (cross-list from eess.IV) [pdf, other]
Title: Contrastive Learning for Local and Global Learning MRI Reconstruction
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Magnetic Resonance Imaging (MRI) is an important medical imaging modality, while it requires a long acquisition time. To reduce the acquisition time, various methods have been proposed. However, these methods failed to reconstruct images with a clear structure for two main reasons. Firstly, similar patches widely exist in MR images, while most previous deep learning-based methods ignore this property and only adopt CNN to learn local information. Secondly, the existing methods only use clear images to constrain the upper bound of the solution space, while the lower bound is not constrained, so that a better parameter of the network cannot be obtained. To address these problems, we propose a Contrastive Learning for Local and Global Learning MRI Reconstruction Network (CLGNet). Specifically, according to the Fourier theory, each value in the Fourier domain is calculated from all the values in Spatial domain. Therefore, we propose a Spatial and Fourier Layer (SFL) to simultaneously learn the local and global information in Spatial and Fourier domains. Moreover, compared with self-attention and transformer, the SFL has a stronger learning ability and can achieve better performance in less time. Based on the SFL, we design a Spatial and Fourier Residual block as the main component of our model. Meanwhile, to constrain the lower bound and upper bound of the solution space, we introduce contrastive learning, which can pull the result closer to the clear image and push the result further away from the undersampled image. Extensive experimental results on different datasets and acceleration rates demonstrate that the proposed CLGNet achieves new state-of-the-art results.

[107]  arXiv:2111.15373 (cross-list from cs.RO) [pdf, other]
Title: ColibriDoc: An Eye-in-Hand Autonomous Trocar Docking System
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Retinal surgery is a complex medical procedure that requires exceptional expertise and dexterity. For this purpose, several robotic platforms are currently being developed to enable or improve the outcome of microsurgical tasks. Since the control of such robots is often designed for navigation inside the eye in proximity to the retina, successful trocar docking and inserting the instrument into the eye represents an additional cognitive effort, and is, therefore, one of the open challenges in robotic retinal surgery. For this purpose, we present a platform for autonomous trocar docking that combines computer vision and a robotic setup. Inspired by the Cuban Colibri (hummingbird) aligning its beak to a flower using only vision, we mount a camera onto the endeffector of a robotic system. By estimating the position and pose of the trocar, the robot is able to autonomously align and navigate the instrument towards the Trocar's Entry Point (TEP) and finally perform the insertion. Our experiments show that the proposed method is able to accurately estimate the position and pose of the trocar and achieve repeatable autonomous docking. The aim of this work is to reduce the complexity of robotic setup preparation prior to the surgical task and therefore, increase the intuitiveness of the system integration into the clinical workflow.

[108]  arXiv:2111.15409 (cross-list from eess.IV) [pdf, other]
Title: Fully Automatic Deep Learning Framework for Pancreatic Ductal Adenocarcinoma Detection on Computed Tomography
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

Early detection improves prognosis in pancreatic ductal adenocarcinoma (PDAC) but is challenging as lesions are often small and poorly defined on contrast-enhanced computed tomography scans (CE-CT). Deep learning can facilitate PDAC diagnosis, however current models still fail to identify small (<2cm) lesions. In this study, state-of-the-art deep learning models were used to develop an automatic framework for PDAC detection, focusing on small lesions. Additionally, the impact of integrating surrounding anatomy was investigated. CE-CT scans from a cohort of 119 pathology-proven PDAC patients and a cohort of 123 patients without PDAC were used to train a nnUnet for automatic lesion detection and segmentation (\textit{nnUnet\_T}). Two additional nnUnets were trained to investigate the impact of anatomy integration: (1) segmenting the pancreas and tumor (\textit{nnUnet\_TP}), (2) segmenting the pancreas, tumor, and multiple surrounding anatomical structures (\textit{nnUnet\_MS}). An external, publicly available test set was used to compare the performance of the three networks. The \textit{nnUnet\_MS} achieved the best performance, with an area under the receiver operating characteristic curve of 0.91 for the whole test set and 0.88 for tumors <2cm, showing that state-of-the-art deep learning can detect small PDAC and benefits from anatomy information.

[109]  arXiv:2111.15498 (cross-list from eess.IV) [pdf]
Title: Assessment of Data Consistency through Cascades of Independently Recurrent Inference Machines for fast and robust accelerated MRI reconstruction
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)

Interpretability and robustness are imperative for integrating Machine Learning methods for accelerated Magnetic Resonance Imaging (MRI) reconstruction in clinical applications. Doing so would allow fast high-quality imaging of anatomy and pathology. Data Consistency (DC) is crucial for generalization in multi-modal data and robustness in detecting pathology. This work proposes the Cascades of Independently Recurrent Inference Machines (CIRIM) to assess DC through unrolled optimization, implicitly by gradient descent and explicitly by a designed term. We perform extensive comparison of the CIRIM to other unrolled optimization methods, being the End-to-End Variational Network (E2EVN) and the RIM, and to the UNet and Compressed Sensing (CS). Evaluation is done in two stages. Firstly, learning on multiple trained MRI modalities is assessed, i.e., brain data with ${T_1}$-weighting and FLAIR contrast, and ${T_2}$-weighted knee data. Secondly, robustness is tested on reconstructing pathology through white matter lesions in 3D FLAIR MRI data of relapsing remitting Multiple Sclerosis (MS) patients. Results show that the CIRIM performs best when implicitly enforcing DC, while the E2EVN requires explicitly formulated DC. The CIRIM shows the highest lesion contrast resolution in reconstructing the clinical MS data. Performance improves by approximately 11% compared to CS, while the reconstruction time is twenty times reduced.

[110]  arXiv:2111.15519 (cross-list from eess.IV) [pdf, other]
Title: Gram Barcodes for Histopathology Tissue Texture Retrieval
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Recent advances in digital pathology have led to the need for Histopathology Image Retrieval (HIR) systems that search through databases of biopsy images to find similar cases to a given query image. These HIR systems allow pathologists to effortlessly and efficiently access thousands of previously diagnosed cases in order to exploit the knowledge in the corresponding pathology reports. Since HIR systems may have to deal with millions of gigapixel images, the extraction of compact and expressive image features must be available to allow for efficient and accurate retrieval. In this paper, we propose the application of Gram barcodes as image features for HIR systems. Unlike most feature generation schemes, Gram barcodes are based on high-order statistics that describe tissue texture by summarizing the correlations between different feature maps in layers of convolutional neural networks. We run HIR experiments on three public datasets using a pre-trained VGG19 network for Gram barcode generation and showcase highly competitive results.

[111]  arXiv:2111.15542 (cross-list from cs.LG) [pdf, other]
Title: Learning to Transfer for Traffic Forecasting via Multi-task Learning
Authors: Yichao Lu
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Deep neural networks have demonstrated superior performance in short-term traffic forecasting. However, most existing traffic forecasting systems assume that the training and testing data are drawn from the same underlying distribution, which limits their practical applicability. The NeurIPS 2021 Traffic4cast challenge is the first of its kind dedicated to benchmarking the robustness of traffic forecasting models towards domain shifts in space and time. This technical report describes our solution to this challenge. In particular, we present a multi-task learning framework for temporal and spatio-temporal domain adaptation of traffic forecasting models. Experimental results demonstrate that our multi-task learning approach achieves strong empirical performance, outperforming a number of baseline domain adaptation methods, while remaining highly efficient. The source code for this technical report is available at https://github.com/YichaoLu/Traffic4cast2021.

[112]  arXiv:2111.15646 (cross-list from cs.LG) [pdf, other]
Title: Exponentially Tilted Gaussian Prior for Variational Autoencoder
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

An important propertyfor deep neural networks to possess is the ability to perform robust out of distribution detection (OOD) on previously unseen data. This property is essential for safety purposes when deploying models for real world applications. Recent studies show that probabilistic generative models can perform poorly on this task, which is surprising given that they seek to estimate the likelihood of training data. To alleviate this issue, we propose the exponentially tilted Gaussian prior distribution for the Variational Autoencoder (VAE). With this prior, we are able to achieve state-of-the art results using just the negative log likelihood that the VAE naturally assigns, while being orders of magnitude faster than some competitive methods. We also show that our model produces high quality image samples which are more crisp than that of a standard Gaussian VAE. The new prior distribution has a very simple implementation which uses a Kullback Leibler divergence that compares the difference between a latent vector's length, and the radius of a sphere.

Replacements for Wed, 1 Dec 21

[113]  arXiv:2005.08551 (replaced) [pdf, other]
Title: Omni-supervised Facial Expression Recognition via Distilled Data
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[114]  arXiv:2009.07734 (replaced) [pdf, other]
Title: TreeGAN: Incorporating Class Hierarchy into Image Generation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[115]  arXiv:2009.08020 (replaced) [pdf, other]
Title: LDNet: End-to-End Lane Marking Detection Approach Using a Dynamic Vision Sensor
Authors: Farzeen Munir (Student Member, IEEE), Shoaib Azam (Student Member, IEEE), Moongu Jeon (Senior Member, IEEE), Byung-Geun Lee (Member, IEEE), Witold Pedrycz (Life Fellow, IEEE)
Journal-ref: Munir, Farzeen, Shoaib Azam, Moongu Jeon, Byung-Geun Lee, and Witold Pedrycz. "LDNet: End-to-End Lane Marking Detection Approach Using a Dynamic Vision Sensor." IEEE Transactions on Intelligent Transportation Systems (2021)
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[116]  arXiv:2101.00062 (replaced) [pdf, other]
Title: FGF-GAN: A Lightweight Generative Adversarial Network for Pansharpening via Fast Guided Filter
Comments: Accepted by ICME 2021 (Oral)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
[117]  arXiv:2101.03531 (replaced) [pdf, other]
Title: Channel Boosting Feature Ensemble for Radar-based Object Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[118]  arXiv:2101.10837 (replaced) [pdf, other]
Title: The Ikshana Hypothesis of Human Scene Understanding
Comments: 22 pages, 7 figures, Technical report
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[119]  arXiv:2101.11986 (replaced) [pdf, other]
Title: Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
Comments: ICCV 2021, codes: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[120]  arXiv:2103.03150 (replaced) [pdf, other]
Title: SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[121]  arXiv:2103.11093 (replaced) [pdf, other]
Title: Exploring The Effect of High-frequency Components in GANs Training
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[122]  arXiv:2103.13998 (replaced) [pdf, other]
Title: GridDehazeNet+: An Enhanced Multi-Scale Network with Intra-Task Knowledge Transfer for Single Image Dehazing
Comments: arXiv admin note: text overlap with arXiv:1908.03245
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[123]  arXiv:2104.02527 (replaced) [pdf, other]
Title: Vote from the Center: 6 DoF Pose Estimation in RGB-D Images by Radial Keypoint Voting
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[124]  arXiv:2104.04532 (replaced) [pdf, other]
Title: Neural RGB-D Surface Reconstruction
Comments: Project page: this https URL Video: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[125]  arXiv:2104.06977 (replaced) [pdf, other]
Title: Discrete Cosine Transform Network for Guided Depth Map Super-Resolution
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[126]  arXiv:2104.14042 (replaced) [pdf, other]
Title: Weather and Light Level Classification for Autonomous Driving: Dataset, Baseline and Active Learning
Comments: Accepted for Oral Presentation at IEEE Intelligent Transportation Systems Conference (ITSC) 2021. Dataset is released in this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[127]  arXiv:2105.07147 (replaced) [pdf, other]
Title: FloorPlanCAD: A Large-Scale CAD Drawing Dataset for Panoptic Symbol Spotting
Comments: v2, 17 pages, 16 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[128]  arXiv:2105.13381 (replaced) [pdf]
Title: Recent advances and clinical applications of deep learning in medical image analysis
Comments: Added content: (1) Transformers in segmentation; (2) Unsupervised anomaly detection; (3) More backgrounds of self-supervised and semi-supervised learning; (4) Figure 1 (taxonomy). Other modifications: (1) Discussion was significantly expanded; (2) Attention mechanisms were introduced as one of the general strategies for performance boost; (3) The introduction of GANs was reduced
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
[129]  arXiv:2106.03135 (replaced) [pdf, other]
Title: Go with the Flows: Mixtures of Normalizing Flows for Point Cloud Generation and Reconstruction
Journal-ref: International Conference on 3D Vision 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[130]  arXiv:2106.03987 (replaced) [pdf, other]
Title: Weakly Supervised Volumetric Image Segmentation with Deformed Templates
Comments: 13 Pages
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[131]  arXiv:2106.04488 (replaced) [pdf, other]
Title: Low-Rank Subspaces in GANs
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[132]  arXiv:2106.13700 (replaced) [pdf, other]
Title: ViTAS: Vision Transformer Architecture Search
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[133]  arXiv:2107.02170 (replaced) [pdf, other]
Title: On Model Calibration for Long-Tailed Object Detection and Instance Segmentation
Comments: Accepted to NeurIPS 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[134]  arXiv:2107.02299 (replaced) [pdf, other]
Title: LightFuse: Lightweight CNN based Dual-exposure Fusion
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
[135]  arXiv:2107.07224 (replaced) [pdf, other]
Title: StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN
Comments: Final draft
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[136]  arXiv:2107.10224 (replaced) [pdf, other]
Title: CycleMLP: A MLP-like Architecture for Dense Prediction
Comments: Technical report. Code: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[137]  arXiv:2107.14228 (replaced) [pdf, other]
Title: Open-World Entity Segmentation
Comments: Project page: this http URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[138]  arXiv:2108.01199 (replaced) [pdf, other]
Title: Neural Image Representations for Multi-Image Fusion and Layer Separation
Comments: Project page: this http URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[139]  arXiv:2108.06810 (replaced) [pdf, other]
Title: SCIDA: Self-Correction Integrated Domain Adaptation from Single- to Multi-label Aerial Images
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[140]  arXiv:2108.10048 (replaced) [pdf, other]
Title: How Transferable Are Self-supervised Features in Medical Image Classification Tasks?
Comments: Accepted to Machine Learning for Health (ML4H) (ML4H 2021)
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[141]  arXiv:2108.13341 (replaced) [pdf, other]
Title: Hire-MLP: Vision MLP via Hierarchical Rearrangement
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[142]  arXiv:2109.05750 (replaced) [pdf, other]
Title: Spatial-Separated Curve Rendering Network for Efficient and High-Resolution Image Harmonization
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[143]  arXiv:2109.07270 (replaced) [pdf, other]
Title: Distract Your Attention: Multi-head Cross Attention Network for Facial Expression Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[144]  arXiv:2109.13925 (replaced) [pdf, other]
Title: Fine-tuning Vision Transformers for the Prediction of State Variables in Ising Models
Comments: Accepted at Ml4Physical Sciences Workshop at Neurips 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
[145]  arXiv:2110.06915 (replaced) [pdf, other]
Title: Object-Region Video Transformers
Comments: Tech report
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[146]  arXiv:2110.11001 (replaced) [pdf, other]
Title: Pixel-Level Face Image Quality Assessment for Explainable Face Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[147]  arXiv:2110.15327 (replaced) [pdf, other]
Title: MEGAN: Memory Enhanced Graph Attention Network for Space-Time Video Super-Resolution
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[148]  arXiv:2111.01215 (replaced) [pdf, other]
Title: Gradient Frequency Modulation for Visually Explaining Video Understanding Models
Comments: Accepted by BMVC 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[149]  arXiv:2111.03039 (replaced) [pdf, other]
Title: Towards Panoptic 3D Parsing for Single Image in the Wild
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[150]  arXiv:2111.06959 (replaced) [pdf, other]
Title: Through-Foliage Tracking with Airborne Optical Sectioning
Comments: 9 Pages, 9 Figures, 1 Table and supplementary videos and material
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[151]  arXiv:2111.08974 (replaced) [pdf, other]
Title: Pedestrian Detection by Exemplar-Guided Contrastive Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[152]  arXiv:2111.10007 (replaced) [pdf, other]
Title: FBNetV5: Neural Architecture Search for Multiple Tasks in One Run
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[153]  arXiv:2111.10969 (replaced) [pdf, other]
Title: Medical Aegis: Robust adversarial protectors for medical images
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[154]  arXiv:2111.12126 (replaced) [pdf, other]
Title: Panoptic Segmentation Meets Remote Sensing
Comments: 40 pages, 10 figures, submitted to journal
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Databases (cs.DB)
[155]  arXiv:2111.12221 (replaced) [pdf]
Title: Source-free unsupervised domain adaptation for cross-modality abdominal multi-organ segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[156]  arXiv:2111.13295 (replaced) [pdf, other]
Title: Medial Spectral Coordinates for 3D Shape Analysis
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[157]  arXiv:2111.13475 (replaced) [pdf, other]
Title: QMagFace: Simple and Accurate Quality-Aware Face Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[158]  arXiv:2111.13495 (replaced) [pdf, other]
Title: In-painting Radiography Images for Unsupervised Anomaly Detection
Comments: Main paper with appendix
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[159]  arXiv:2111.14075 (replaced) [pdf]
Title: Image preprocessing and modified adaptive thresholding for improving OCR
Comments: 5 pages, 7 figues
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[160]  arXiv:2111.14547 (replaced) [pdf, other]
Title: LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering
Comments: 11 pages, 5 figures, Code: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[161]  arXiv:2111.14562 (replaced) [pdf, other]
Title: Instance-wise Occlusion and Depth Orders in Natural Scenes
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[162]  arXiv:2111.14605 (replaced) [pdf, other]
Title: Weakly-supervised Generative Adversarial Networks for medical image classification
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[163]  arXiv:2111.14672 (replaced) [pdf, other]
Title: Human Performance Capture from Monocular Video in the Wild
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[164]  arXiv:2111.14799 (replaced) [pdf, other]
Title: UBoCo : Unsupervised Boundary Contrastive Learning for Generic Event Boundary Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[165]  arXiv:1810.02244 (replaced) [pdf, other]
Title: Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Comments: Extended version with proofs, accepted at AAAI 2019, added units of measurement of QM9 dataset into appendix, removed results from Wu et al., 2018 due to different units
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
[166]  arXiv:2012.09831 (replaced) [pdf, other]
Title: On Episodes, Prototypical Networks, and Few-shot Learning
Comments: 18 pages. To appear at NeurIPS 2021. A preliminary version of this work appeared as an oral presentation at the NeurIPS 2020 meta-learning workshop
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[167]  arXiv:2106.08601 (replaced) [pdf, other]
Title: Self-Supervised GANs with Label Augmentation
Comments: Accepted at NeurIPS 2021
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[168]  arXiv:2107.11472 (replaced) [pdf, other]
Title: Clipped Hyperbolic Classifiers Are Super-Hyperbolic Classifiers
Comments: 18 pages, 9 figures
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[169]  arXiv:2107.13576 (replaced) [pdf, other]
Title: Social Processes: Self-Supervised Meta-Learning over Conversational Groups for Forecasting Nonverbal Social Cues
Comments: 12 pages, 8 pages Appendices, 10 figures, 8 tables
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[170]  arXiv:2111.03047 (replaced) [pdf, other]
Title: A deep ensemble approach to X-ray polarimetry
Comments: Fourth Workshop on Machine Learning and the Physical Sciences (NeurIPS 2021)
Subjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[171]  arXiv:2111.04742 (replaced) [pdf, other]
Title: E(2) Equivariant Self-Attention for Radio Astronomy
Comments: Accepted in: Fourth Workshop on Machine Learning and the Physical Sciences (35th Conference on Neural Information Processing Systems; NeurIPS2021); final version; 7 pages, 3 figures
Subjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
[172]  arXiv:2111.05978 (replaced) [pdf, other]
Title: Trustworthy Medical Segmentation with Uncertainty Estimation
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[173]  arXiv:2111.06206 (replaced) [pdf, other]
Title: Towards Axiomatic, Hierarchical, and Symbolic Explanation for Deep Models
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
[174]  arXiv:2111.10991 (replaced) [pdf, other]
Title: Backdoor Attack through Frequency Domain
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
[175]  arXiv:2111.14693 (replaced) [pdf, other]
Title: SAGCI-System: Towards Sample-Efficient, Generalizable, Compositional, and Incremental Robot Learning
Comments: Submitted to IEEE International Conference on Robotics and Automation (ICRA) 2022
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[ total of 175 entries: 1-175 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2111, contact, help  (Access key information)