We gratefully acknowledge support from
the Simons Foundation and member institutions.

Computer Vision and Pattern Recognition

New submissions

[ total of 43 entries: 1-43 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Thu, 26 Jan 23

[1]  arXiv:2301.10293 [pdf]
Title: A Fast Feature Point Matching Algorithm Based on IMU Sensor
Authors: Lu Cao
Comments: 6 pages, 4 figures, 2 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In simultaneous localization and mapping (SLAM), image feature point matching process consume a lot of time. The capacity of low-power systems such as embedded systems is almost limited. It is difficult to ensure the timely processing of each image information. To reduce time consuming when matching feature points in SLAM, an algorithm of using inertial measurement unit (IMU) to optimize the efficiency of image feature point matching is proposed. When matching two image feature points, the presented algorithm does not need to traverse the whole image for matching feature points, just around the predicted point within a small range traversal search to find matching feature points. After compared with the traditional algorithm, the experimental results show that this method has greatly reduced the consumption of image feature points matching time. All the conclusions will help research how to use the IMU optimize the efficiency of image feature point matching and improve the real-time performance in SLAM.

[2]  arXiv:2301.10295 [pdf, other]
Title: Object Segmentation with Audio Context
Comments: Research project for Introduction to Deep Learning (11785) at Carnegie Mellon University
Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Visual objects often have acoustic signatures that are naturally synchronized with them in audio-bearing video recordings. For this project, we explore the multimodal feature aggregation for video instance segmentation task, in which we integrate audio features into our video segmentation model to conduct an audio-visual learning scheme. Our method is based on existing video instance segmentation method which leverages rich contextual information across video frames. Since this is the first attempt to investigate the audio-visual instance segmentation, a novel dataset, including 20 vocal classes with synchronized video and audio recordings, is collected. By utilizing combined decoder to fuse both video and audio features, our model shows a slight improvements compared to the base model. Additionally, we managed to show the effectiveness of different modules by conducting extensive ablations.

[3]  arXiv:2301.10351 [pdf, other]
Title: Few-Shot Learning Enables Population-Scale Analysis of Leaf Traits in Populus trichocarpa
Subjects: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

Plant phenotyping is typically a time-consuming and expensive endeavor, requiring large groups of researchers to meticulously measure biologically relevant plant traits, and is the main bottleneck in understanding plant adaptation and the genetic architecture underlying complex traits at population scale. In this work, we address these challenges by leveraging few-shot learning with convolutional neural networks (CNNs) to segment the leaf body and visible venation of 2,906 P. trichocarpa leaf images obtained in the field. In contrast to previous methods, our approach (i) does not require experimental or image pre-processing, (ii) uses the raw RGB images at full resolution, and (iii) requires very few samples for training (e.g., just eight images for vein segmentation). Traits relating to leaf morphology and vein topology are extracted from the resulting segmentations using traditional open-source image-processing tools, validated using real-world physical measurements, and used to conduct a genome-wide association study to identify genes controlling the traits. In this way, the current work is designed to provide the plant phenotyping community with (i) methods for fast and accurate image-based feature extraction that require minimal training data, and (ii) a new population-scale data set, including 68 different leaf phenotypes, for domain scientists and machine learning researchers. All of the few-shot learning code, data, and results are made publicly available.

[4]  arXiv:2301.10413 [pdf, other]
Title: Local Feature Extraction from Salient Regions by Feature Map Transformation
Comments: British Machine Vision Conference (BMVC) 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Local feature matching is essential for many applications, such as localization and 3D reconstruction. However, it is challenging to match feature points accurately in various camera viewpoints and illumination conditions. In this paper, we propose a framework that robustly extracts and describes salient local features regardless of changing light and viewpoints. The framework suppresses illumination variations and encourages structural information to ignore the noise from light and to focus on edges. We classify the elements in the feature covariance matrix, an implicit feature map information, into two components. Our model extracts feature points from salient regions leading to reduced incorrect matches. In our experiments, the proposed method achieved higher accuracy than the state-of-the-art methods in the public dataset, such as HPatches, Aachen Day-Night, and ETH, which especially show highly variant viewpoints and illumination.

[5]  arXiv:2301.10431 [pdf, other]
Title: Bias-Compensated Integral Regression for Human Pose Estimation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In human and hand pose estimation, heatmaps are a crucial intermediate representation for a body or hand keypoint. Two popular methods to decode the heatmap into a final joint coordinate are via an argmax, as done in heatmap detection, or via softmax and expectation, as done in integral regression. Integral regression is learnable end-to-end, but has lower accuracy than detection. This paper uncovers an induced bias from integral regression that results from combining the softmax and the expectation operation. This bias often forces the network to learn degenerately localized heatmaps, obscuring the keypoint's true underlying distribution and leads to lower accuracies. Training-wise, by investigating the gradients of integral regression, we show that the implicit guidance of integral regression to update the heatmap makes it slower to converge than detection. To counter the above two limitations, we propose Bias Compensated Integral Regression (BCIR), an integral regression-based framework that compensates for the bias. BCIR also incorporates a Gaussian prior loss to speed up training and improve prediction accuracy. Experimental results on both the human body and hand benchmarks show that BCIR is faster to train and more accurate than the original integral regression, making it competitive with state-of-the-art detection methods.

[6]  arXiv:2301.10441 [pdf, other]
Title: Learning Trustworthy Model from Noisy Labels based on Rough Set for Surface Defect Detection
Comments: 12 pages, 8figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

In the surface defect detection, there are some suspicious regions that cannot be uniquely classified as abnormal or normal. The annotating of suspicious regions is easily affected by factors such as workers' emotional fluctuations and judgment standard, resulting in noisy labels, which in turn leads to missing and false detections, and ultimately leads to inconsistent judgments of product quality. Unlike the usual noisy labels, the ones used for surface defect detection appear to be inconsistent rather than mislabeled. The noise occurs in almost every label and is difficult to correct or evaluate. In this paper, we proposed a framework that learns trustworthy models from noisy labels for surface defect defection. At first, to avoid the negative impact of noisy labels on the model, we represent the suspicious regions with consistent and precise elements at the pixel-level and redesign the loss function. Secondly, without changing network structure and adding any extra labels, pluggable spatially correlated Bayesian module is proposed. Finally, the defect discrimination confidence is proposed to measure the uncertainty, with which anomalies can be identified as defects. Our results indicate not only the effectiveness of the proposed method in learning from noisy labels, but also robustness and real-time performance.

[7]  arXiv:2301.10460 [pdf, other]
Title: HAL3D: Hierarchical Active Learning for Fine-Grained 3D Part Labeling
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We present the first active learning tool for fine-grained 3D part labeling, a problem which challenges even the most advanced deep learning (DL) methods due to the significant structural variations among the small and intricate parts. For the same reason, the necessary data annotation effort is tremendous, motivating approaches to minimize human involvement. Our labeling tool iteratively verifies or modifies part labels predicted by a deep neural network, with human feedback continually improving the network prediction. To effectively reduce human efforts, we develop two novel features in our tool, hierarchical and symmetry-aware active labeling. Our human-in-the-loop approach, coined HAL3D, achieves 100% accuracy (barring human errors) on any test set with pre-defined hierarchical part labels, with 80% time-saving over manual effort.

[8]  arXiv:2301.10473 [pdf, other]
Title: Aircraft Skin Inspections: Towards a New Model for Dent Evaluation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Aircraft maintenance, repair and overhaul (MRO) industry is gradually switching to 3D scanning for dent inspection. High-accuracy devices allow quick and repeatable measurements, which translate into efficient reporting and more objective damage evaluations. However, the potential of 3D scanners is far from being exploited. This is due to the traditional way in which the structural repair manual (SRM) deals with dents, that is, considering length, width and depth as the only relevant measures. Being equivalent to describing a dent similarly to a "box", the current approach discards any information about the actual shape. This causes high degrees of ambiguity, with very different shapes (and corresponding fatigue life) being classified as the same, and nullifies the effort of acquiring such great amount of information from high-accuracy 3D scanners. In this paper a $7$-parameter model is proposed to describe the actual dent shape, thus enabling the exploitation of the high fidelity data produced by 3D scanners. The compact set of values can then be compared against historical data and structural evaluations based on the same model.

[9]  arXiv:2301.10492 [pdf, other]
Title: Flow-guided Semi-supervised Video Object Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We propose an optical flow-guided approach for semi-supervised video object segmentation. Optical flow is usually exploited as additional guidance information in unsupervised video object segmentation. However, its relevance in semi-supervised video object segmentation has not been fully explored. In this work, we follow an encoder-decoder approach to address the segmentation task. A model to extract the combined information from optical flow and the image is proposed, which is then used as input to the target model and the decoder network. Unlike previous methods where concatenation is used to integrate information from image data and optical flow, a simple yet effective attention mechanism is exploited in our work. Experiments on DAVIS 2017 and YouTube-VOS 2019 show that by integrating the information extracted from optical flow into the original image branch results in a strong performance gain and our method achieves state-of-the-art performance.

[10]  arXiv:2301.10531 [pdf, other]
Title: 3D Tooth Mesh Segmentation with Simplified Mesh Cell Representation
Comments: accepted at IEEE ISBI 2023 International Symposium on Biomedical Imaging
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Manual tooth segmentation of 3D tooth meshes is tedious and there is variations among dentists. %Manual tooth annotation of 3D tooth meshes is a tedious task. Several deep learning based methods have been proposed to perform automatic tooth mesh segmentation. Many of the proposed tooth mesh segmentation algorithms summarize the mesh cell as - the cell center or barycenter, the normal at barycenter, the cell vertices and the normals at the cell vertices. Summarizing of the mesh cell/triangle in this manner imposes an implicit structural constraint and makes it difficult to work with multiple resolutions which is done in many point cloud based deep learning algorithms. We propose a novel segmentation method which utilizes only the barycenter and the normal at the barycenter information of the mesh cell and yet achieves competitive performance. We are the first to demonstrate that it is possible to relax the implicit structural constraint and yet achieve superior segmentation performance

[11]  arXiv:2301.10540 [pdf, other]
Title: Modelling Long Range Dependencies in N-D: From Task-Specific to a General Purpose CNN
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Performant Convolutional Neural Network (CNN) architectures must be tailored to specific tasks in order to consider the length, resolution, and dimensionality of the input data. In this work, we tackle the need for problem-specific CNN architectures. We present the Continuous Convolutional Neural Network (CCNN): a single CNN able to process data of arbitrary resolution, dimensionality and length without any structural changes. Its key component are its continuous convolutional kernels which model long-range dependencies at every layer, and thus remove the need of current CNN architectures for task-dependent downsampling and depths. We showcase the generality of our method by using the same architecture for tasks on sequential ($1{\rm D}$), visual ($2{\rm D}$) and point-cloud ($3{\rm D}$) data. Our CCNN matches and often outperforms the current state-of-the-art across all tasks considered.

[12]  arXiv:2301.10551 [pdf, other]
Title: Variation-Aware Semantic Image Synthesis
Comments: 12 pages, 3 figures, 5 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Semantic image synthesis (SIS) aims to produce photorealistic images aligning to given conditional semantic layout and has witnessed a significant improvement in recent years. Although the diversity in image-level has been discussed heavily, class-level mode collapse widely exists in current algorithms. Therefore, we declare a new requirement for SIS to achieve more photorealistic images, variation-aware, which consists of inter- and intra-class variation. The inter-class variation is the diversity between different semantic classes while the intra-class variation stresses the diversity inside one class. Through analysis, we find that current algorithms elusively embrace the inter-class variation but the intra-class variation is still not enough. Further, we introduce two simple methods to achieve variation-aware semantic image synthesis (VASIS) with a higher intra-class variation, semantic noise and position code. We combine our method with several state-of-the-art algorithms and the experimental result shows that our models generate more natural images and achieves slightly better FIDs and/or mIoUs than the counterparts. Our codes and models will be publicly available.

[13]  arXiv:2301.10559 [pdf, other]
Title: Tracking Different Ant Species: An Unsupervised Domain Adaptation Framework and a Dataset for Multi-object Tracking
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Tracking individuals is a vital part of many experiments conducted to understand collective behaviour. Ants are the paradigmatic model system for such experiments but their lack of individually distinguishing visual features and their high colony densities make it extremely difficult to perform reliable tracking automatically. Additionally, the wide diversity of their species' appearances makes a generalized approach even harder. In this paper, we propose a data-driven multi-object tracker that, for the first time, employs domain adaptation to achieve the required generalisation. This approach is built upon a joint-detection-and-tracking framework that is extended by a set of domain discriminator modules integrating an adversarial training strategy in addition to the tracking loss. In addition to this novel domain-adaptive tracking framework, we present a new dataset and a benchmark for the ant tracking problem. The dataset contains 57 video sequences with full trajectory annotation, including 30k frames captured from two different ant species moving on different background patterns. It comprises 33 and 24 sequences for source and target domains, respectively. We compare our proposed framework against other domain-adaptive and non-domain-adaptive multi-object tracking baselines using this dataset and show that incorporating domain adaptation at multiple levels of the tracking pipeline yields significant improvements. The code and the dataset are available at https://github.com/chamathabeysinghe/da-tracker.

[14]  arXiv:2301.10575 [pdf, other]
Title: Trainable Loss Weights in Super-Resolution
Comments: 7 pages, 3 figures, 1 table
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

In recent years, research on super-resolution has primarily focused on the development of unsupervised models, blind networks, and the use of optimization methods in non-blind models. But, limited research has discussed the loss function in the super-resolution process. The majority of those studies have only used perceptual similarity in a conventional way. This is while the development of appropriate loss can improve the quality of other methods as well. In this article, a new weighting method for pixel-wise loss is proposed. With the help of this method, it is possible to use trainable weights based on the general structure of the image and its perceptual features while maintaining the advantages of pixel-wise loss. Also, a criterion for comparing weights of loss is introduced so that the weights can be estimated directly by a convolutional neural network using this criterion. In addition, in this article, the expectation-maximization method is used for the simultaneous estimation super-resolution network and weighting network. In addition, a new activation function, called "FixedSum", is introduced which can keep the sum of all components of vector constants while keeping the output components between zero and one. As shown in the experimental results section, weighted loss by the proposed method leads to better results than the unweighted loss in both signal-to-noise and perceptual similarity senses.

[15]  arXiv:2301.10583 [pdf, other]
Title: An Efficient Approximate Method for Online Convolutional Dictionary Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Most existing convolutional dictionary learning (CDL) algorithms are based on batch learning, where the dictionary filters and the convolutional sparse representations are optimized in an alternating manner using a training dataset. When large training datasets are used, batch CDL algorithms become prohibitively memory-intensive. An online-learning technique is used to reduce the memory requirements of CDL by optimizing the dictionary incrementally after finding the sparse representations of each training sample. Nevertheless, learning large dictionaries using the existing online CDL (OCDL) algorithms remains highly computationally expensive. In this paper, we present a novel approximate OCDL method that incorporates sparse decomposition of the training samples. The resulting optimization problems are addressed using the alternating direction method of multipliers. Extensive experimental evaluations using several image datasets show that the proposed method substantially reduces computational costs while preserving the effectiveness of the state-of-the-art OCDL algorithms.

[16]  arXiv:2301.10584 [pdf, other]
Title: A Method For Eliminating Contour Errors In Self-Encoder Reconstructed Images
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

In this paper, we propose a self-supervised twin network approach based on this a priori. The method of generating the approximate10 edge information of an image and then differentially eliminating the edge errors11 in the reconstructed image with a dilate algorithm. This is used to improve the12 accuracy of the reconstructed image and to separate foreign matter and noise from13 the original image, so that it can be visualized in a more practical scene

[17]  arXiv:2301.10593 [pdf, other]
Title: Faster DAN: Multi-target Queries with Document Positional Encoding for End-to-end Handwritten Document Recognition
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent advances in handwritten text recognition enabled to recognize whole documents in an end-to-end way: the Document Attention Network (DAN) recognizes the characters one after the other through an attention-based prediction process until reaching the end of the document. However, this autoregressive process leads to inference that cannot benefit from any parallelization optimization. In this paper, we propose Faster DAN, a two-step strategy to speed up the recognition process at prediction time: the model predicts the first character of each text line in the document, and then completes all the text lines in parallel through multi-target queries and a specific document positional encoding scheme. Faster DAN reaches competitive results compared to standard DAN, while being at least 4 times faster on whole single-page and double-page images of the RIMES 2009, READ 2016 and MAURDOR datasets. Source code and trained model weights are available at https://github.com/FactoDeepLearning/FasterDAN.

[18]  arXiv:2301.10608 [pdf, other]
Title: Connecting metrics for shape-texture knowledge in computer vision
Comments: 7 pages, 3 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Modern artificial neural networks, including convolutional neural networks and vision transformers, have mastered several computer vision tasks, including object recognition. However, there are many significant differences between the behavior and robustness of these systems and of the human visual system. Deep neural networks remain brittle and susceptible to many changes in the image that do not cause humans to misclassify images. Part of this different behavior may be explained by the type of features humans and deep neural networks use in vision tasks. Humans tend to classify objects according to their shape while deep neural networks seem to rely mostly on texture. Exploring this question is relevant, since it may lead to better performing neural network architectures and to a better understanding of the workings of the vision system of primates. In this work, we advance the state of the art in our understanding of this phenomenon, by extending previous analyses to a much larger set of deep neural network architectures. We found that the performance of models in image classification tasks is highly correlated with their shape bias measured at the output and penultimate layer. Furthermore, our results showed that the number of neurons that represent shape and texture are strongly anti-correlated, thus providing evidence that there is competition between these two types of features. Finally, we observed that while in general there is a correlation between performance and shape bias, there are significant variations between architecture families.

[19]  arXiv:2301.10611 [pdf, other]
Title: Discriminator-free Unsupervised Domain Adaptation for Multi-label Image Classification
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, a discriminator-free adversarial-based Unsupervised Domain Adaptation (UDA) for Multi-Label Image Classification (MLIC) referred to as DDA-MLIC is proposed. Over the last two years, some attempts have been made for introducing adversarial-based UDA methods in the context of MLIC. However, these methods which rely on an additional discriminator subnet present two shortcomings. First, the learning of domain-invariant features may harm their task-specific discriminative power, since the classification and discrimination tasks are decoupled. Moreover, the use of an additional discriminator usually induces an increase of the network size. Herein, we propose to overcome these issues by introducing a novel adversarial critic that is directly deduced from the task-specific classifier. Specifically, a two-component Gaussian Mixture Model (GMM) is fitted on the source and target predictions, allowing the distinction of two clusters. This allows extracting a Gaussian distribution for each component. The resulting Gaussian distributions are then used for formulating an adversarial loss based on a Frechet distance. The proposed method is evaluated on three multi-label image datasets. The obtained results demonstrate that DDA-MLIC outperforms existing state-of-the-art methods while requiring a lower number of parameters.

[20]  arXiv:2301.10625 [pdf, other]
Title: Toward Realistic Evaluation of Deep Active Learning Algorithms in Image Classification
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Active Learning (AL) aims to reduce the labeling burden by interactively querying the most informative observations from a data pool. Despite extensive research on improving AL query methods in the past years, recent studies have questioned the advantages of AL, especially in the light of emerging alternative training paradigms such as semi-supervised (Semi-SL) and self-supervised learning (Self-SL). Thus, today's AL literature paints an inconsistent picture and leaves practitioners wondering whether and how to employ AL in their tasks. We argue that this heterogeneous landscape is caused by a lack of a systematic and realistic evaluation of AL algorithms, including key parameters such as complex and imbalanced datasets, realistic labeling scenarios, systematic method configuration, and integration of Semi-SL and Self-SL. To this end, we present an AL benchmarking suite and run extensive experiments on five datasets shedding light on the questions: when and how to apply AL?

[21]  arXiv:2301.10670 [pdf, other]
Title: Towards Arbitrary Text-driven Image Manipulation via Space Alignment
Comments: 8 pages, 12 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The recent GAN inversion methods have been able to successfully invert the real image input to the corresponding editable latent code in StyleGAN. By combining with the language-vision model (CLIP), some text-driven image manipulation methods are proposed. However, these methods require extra costs to perform optimization for a certain image or a new attribute editing mode. To achieve a more efficient editing method, we propose a new Text-driven image Manipulation framework via Space Alignment (TMSA). The Space Alignment module aims to align the same semantic regions in CLIP and StyleGAN spaces. Then, the text input can be directly accessed into the StyleGAN space and be used to find the semantic shift according to the text description. The framework can support arbitrary image editing mode without additional cost. Our work provides the user with an interface to control the attributes of a given image according to text input and get the result in real time. Ex tensive experiments demonstrate our superior performance over prior works.

[22]  arXiv:2301.10732 [pdf, other]
Title: An Efficient Semi-Automated Scheme for Infrastructure LiDAR Annotation
Comments: Submitted to IEEE Intelligent Transportation Systems Transactions
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Most existing perception systems rely on sensory data acquired from cameras, which perform poorly in low light and adverse weather conditions. To resolve this limitation, we have witnessed advanced LiDAR sensors become popular in perception tasks in autonomous driving applications. Nevertheless, their usage in traffic monitoring systems is less ubiquitous. We identify two significant obstacles in cost-effectively and efficiently developing such a LiDAR-based traffic monitoring system: (i) public LiDAR datasets are insufficient for supporting perception tasks in infrastructure systems, and (ii) 3D annotations on LiDAR point clouds are time-consuming and expensive. To fill this gap, we present an efficient semi-automated annotation tool that automatically annotates LiDAR sequences with tracking algorithms while offering a fully annotated infrastructure LiDAR dataset -- FLORIDA (Florida LiDAR-based Object Recognition and Intelligent Data Annotation) -- which will be made publicly available. Our advanced annotation tool seamlessly integrates multi-object tracking (MOT), single-object tracking (SOT), and suitable trajectory post-processing techniques. Specifically, we introduce a human-in-the-loop schema in which annotators recursively fix and refine annotations imperfectly predicted by our tool and incrementally add them to the training dataset to obtain better SOT and MOT models. By repeating the process, we significantly increase the overall annotation speed by three to four times and obtain better qualitative annotations than a state-of-the-art annotation tool. The human annotation experiments verify the effectiveness of our annotation tool. In addition, we provide detailed statistics and object detection evaluation results for our dataset in serving as a benchmark for perception tasks at traffic intersections.

[23]  arXiv:2301.10750 [pdf, ps, other]
Title: Out of Distribution Performance of State of Art Vision Model
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The vision transformer (ViT) has advanced to the cutting edge in the visual recognition task. Transformers are more robust than CNN, according to the latest research. ViT's self-attention mechanism, according to the claim, makes it more robust than CNN. Even with this, we discover that these conclusions are based on unfair experimental conditions and just comparing a few models, which did not allow us to depict the entire scenario of robustness performance. In this study, we investigate the performance of 58 state-of-the-art computer vision models in a unified training setup based not only on attention and convolution mechanisms but also on neural networks based on a combination of convolution and attention mechanisms, sequence-based model, complementary search, and network-based method. Our research demonstrates that robustness depends on the training setup and model types, and performance varies based on out-of-distribution type. Our research will aid the community in better understanding and benchmarking the robustness of computer vision models.

[24]  arXiv:2301.10759 [pdf, other]
Title: Efficient Flow-Guided Multi-frame De-fencing
Comments: 16 pages, 12 figures. Published at the Winter Conference on Application of Computer Vision (WACV) 2023
Journal-ref: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2023, pp. 1838-1847
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Taking photographs ''in-the-wild'' is often hindered by fence obstructions that stand between the camera user and the scene of interest, and which are hard or impossible to avoid. De-fencing is the algorithmic process of automatically removing such obstructions from images, revealing the invisible parts of the scene. While this problem can be formulated as a combination of fence segmentation and image inpainting, this often leads to implausible hallucinations of the occluded regions. Existing multi-frame approaches rely on propagating information to a selected keyframe from its temporal neighbors, but they are often inefficient and struggle with alignment of severely obstructed images. In this work we draw inspiration from the video completion literature and develop a simplified framework for multi-frame de-fencing that computes high quality flow maps directly from obstructed frames and uses them to accurately align frames. Our primary focus is efficiency and practicality in a real-world setting: the input to our algorithm is a short image burst (5 frames) - a data modality commonly available in modern smartphones - and the output is a single reconstructed keyframe, with the fence removed. Our approach leverages simple yet effective CNN modules, trained on carefully generated synthetic data, and outperforms more complicated alternatives real bursts, both quantitatively and qualitatively, while running real-time.

[25]  arXiv:2301.10766 [pdf, other]
Title: On the Adversarial Robustness of Camera-based 3D Object Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In recent years, camera-based 3D object detection has gained widespread attention for its ability to achieve high performance with low computational cost. However, the robustness of these methods to adversarial attacks has not been thoroughly examined. In this study, we conduct the first comprehensive investigation of the robustness of leading camera-based 3D object detection methods under various adversarial conditions. Our experiments reveal five interesting findings: (a) the use of accurate depth estimation effectively improves robustness; (b) depth-estimation-free approaches do not show superior robustness; (c) bird's-eye-view-based representations exhibit greater robustness against localization attacks; (d) incorporating multi-frame benign inputs can effectively mitigate adversarial attacks; and (e) addressing long-tail problems can enhance robustness. We hope our work can provide guidance for the design of future camera-based object detection modules with improved adversarial robustness.

Cross-lists for Thu, 26 Jan 23

[26]  arXiv:2108.13161 (cross-list from cs.CL) [pdf, other]
Title: Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners
Comments: Accepted by ICLR 2022
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)

Large-scale pre-trained language models have contributed significantly to natural language processing by demonstrating remarkable abilities as few-shot learners. However, their effectiveness depends mainly on scaling the model parameters and prompt design, hindering their implementation in most real-world applications. This study proposes a novel pluggable, extensible, and efficient approach named DifferentiAble pRompT (DART), which can convert small language models into better few-shot learners without any prompt engineering. The main principle behind this approach involves reformulating potential natural language processing tasks into the task of a pre-trained language model and differentially optimizing the prompt template as well as the target label with backpropagation. Furthermore, the proposed approach can be: (i) Plugged to any pre-trained language models; (ii) Extended to widespread classification tasks. A comprehensive evaluation of standard NLP tasks demonstrates that the proposed approach achieves a better few-shot performance. Code is available in https://github.com/zjunlp/DART.

[27]  arXiv:2301.10327 (cross-list from cs.LG) [pdf, other]
Title: Generating Multidimensional Clusters With Support Lines
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Programming Languages (cs.PL)

Synthetic data is essential for assessing clustering techniques, complementing and extending real data, and allowing for a more complete coverage of a given problem's space. In turn, synthetic data generators have the potential of creating vast amounts of data -- a crucial activity when real-world data is at premium -- while providing a well-understood generation procedure and an interpretable instrument for methodically investigating cluster analysis algorithms. Here, we present \textit{Clugen}, a modular procedure for synthetic data generation, capable of creating multidimensional clusters supported by line segments using arbitrary distributions. \textit{Clugen} is open source, 100\% unit tested and fully documented, and is available for the Python, R, Julia and MATLAB/Octave ecosystems. We demonstrate that our proposal is able to produce rich and varied results in various dimensions, is fit for use in the assessment of clustering algorithms, and has the potential to be a widely used framework in diverse clustering-related research tasks.

[28]  arXiv:2301.10365 (cross-list from eess.IV) [pdf, other]
Title: Data Consistent Deep Rigid MRI Motion Correction
Comments: 13 pages, 5 figures, motion correction, magnetic resonance imaging, deep learning
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Motion artifacts are a pervasive problem in MRI, leading to misdiagnosis or mischaracterization in population-level imaging studies. Current retrospective rigid intra-slice motion correction techniques jointly optimize estimates of the image and the motion parameters. In this paper, we use a deep network to reduce the joint image-motion parameter search to a search over rigid motion parameters alone. Our network produces a reconstruction as a function of two inputs: corrupted k-space data and motion parameters. We train the network using simulated, motion-corrupted k-space data generated from known motion parameters. At test-time, we estimate unknown motion parameters by minimizing a data consistency loss between the motion parameters, the network-based image reconstruction given those parameters, and the acquired measurements. Intra-slice motion correction experiments on simulated and realistic 2D fast spin echo brain MRI achieve high reconstruction fidelity while retaining the benefits of explicit data consistency-based optimization. Our code is publicly available at https://www.github.com/nalinimsingh/neuroMoCo.

[29]  arXiv:2301.10418 (cross-list from cs.LG) [pdf, other]
Title: DEJA VU: Continual Model Generalization For Unseen Domains
Comments: Published as a conference paper at ICLR 2023
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

In real-world applications, deep learning models often run in non-stationary environments where the target data distribution continually shifts over time. There have been numerous domain adaptation (DA) methods in both online and offline modes to improve cross-domain adaptation ability. However, these DA methods typically only provide good performance after a long period of adaptation, and perform poorly on new domains before and during adaptation - in what we call the "Unfamiliar Period", especially when domain shifts happen suddenly and significantly. On the other hand, domain generalization (DG) methods have been proposed to improve the model generalization ability on unadapted domains. However, existing DG works are ineffective for continually changing domains due to severe catastrophic forgetting of learned knowledge. To overcome these limitations of DA and DG in handling the Unfamiliar Period during continual domain shift, we propose RaTP, a framework that focuses on improving models' target domain generalization (TDG) capability, while also achieving effective target domain adaptation (TDA) capability right after training on certain domains and forgetting alleviation (FA) capability on past domains. RaTP includes a training-free data augmentation module to prepare data for TDG, a novel pseudo-labeling mechanism to provide reliable supervision for TDA, and a prototype contrastive alignment algorithm to align different domains for achieving TDG, TDA and FA. Extensive experiments on Digits, PACS, and DomainNet demonstrate that RaTP significantly outperforms state-of-the-art works from Continual DA, Source-Free DA, Test-Time/Online DA, Single DG, Multiple DG and Unified DA&DG in TDG, and achieves comparable TDA and FA capabilities.

[30]  arXiv:2301.10454 (cross-list from cs.LG) [pdf, other]
Title: A Data-Centric Approach for Improving Adversarial Training Through the Lens of Out-of-Distribution Detection
Comments: Accepted to CSICC 2023
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Current machine learning models achieve super-human performance in many real-world applications. Still, they are susceptible against imperceptible adversarial perturbations. The most effective solution for this problem is adversarial training that trains the model with adversarially perturbed samples instead of original ones. Various methods have been developed over recent years to improve adversarial training such as data augmentation or modifying training attacks. In this work, we examine the same problem from a new data-centric perspective. For this purpose, we first demonstrate that the existing model-based methods can be equivalent to applying smaller perturbation or optimization weights to the hard training examples. By using this finding, we propose detecting and removing these hard samples directly from the training procedure rather than applying complicated algorithms to mitigate their effects. For detection, we use maximum softmax probability as an effective method in out-of-distribution detection since we can consider the hard samples as the out-of-distribution samples for the whole data distribution. Our results on SVHN and CIFAR-10 datasets show the effectiveness of this method in improving the adversarial training without adding too much computational cost.

[31]  arXiv:2301.10455 (cross-list from eess.IV) [pdf, other]
Title: Rate-Perception Optimized Preprocessing for Video Coding
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

In the past decades, lots of progress have been done in the video compression field including traditional video codec and learning-based video codec. However, few studies focus on using preprocessing techniques to improve the rate-distortion performance. In this paper, we propose a rate-perception optimized preprocessing (RPP) method. We first introduce an adaptive Discrete Cosine Transform loss function which can save the bitrate and keep essential high frequency components as well. Furthermore, we also combine several state-of-the-art techniques from low-level vision fields into our approach, such as the high-order degradation model, efficient lightweight network design, and Image Quality Assessment model. By jointly using these powerful techniques, our RPP approach can achieve on average, 16.27% bitrate saving with different video encoders like AVC, HEVC, and VVC under multiple quality metrics. In the deployment stage, our RPP method is very simple and efficient which is not required any changes in the setting of video encoding, streaming, and decoding. Each input frame only needs to make a single pass through RPP before sending into video encoders. In addition, in our subjective visual quality test, 87% of users think videos with RPP are better or equal to videos by only using the codec to compress, while these videos with RPP save about 12% bitrate on average. Our RPP framework has been integrated into the production environment of our video transcoding services which serve millions of users every day.

[32]  arXiv:2301.10520 (cross-list from eess.IV) [pdf, other]
Title: Ultra-NeRF: Neural Radiance Fields for Ultrasound Imaging
Comments: submitted to MIDL
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

We present a physics-enhanced implicit neural representation (INR) for ultrasound (US) imaging that learns tissue properties from overlapping US sweeps. Our proposed method leverages a ray-tracing-based neural rendering for novel view US synthesis. Recent publications demonstrated that INR models could encode a representation of a three-dimensional scene from a set of two-dimensional US frames. However, these models fail to consider the view-dependent changes in appearance and geometry intrinsic to US imaging. In our work, we discuss direction-dependent changes in the scene and show that a physics-inspired rendering improves the fidelity of US image synthesis. In particular, we demonstrate experimentally that our proposed method generates geometrically accurate B-mode images for regions with ambiguous representation owing to view-dependent differences of the US images. We conduct our experiments using simulated B-mode US sweeps of the liver and acquired US sweeps of a spine phantom tracked with a robotic arm. The experiments corroborate that our method generates US frames that enable consistent volume compounding from previously unseen views. To the best of our knowledge, the presented work is the first to address view-dependent US image synthesis using INR.

[33]  arXiv:2301.10687 (cross-list from eess.IV) [pdf, other]
Title: Self-Supervised Curricular Deep Learning for Chest X-Ray Image Classification
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Deep learning technologies have already demonstrated a high potential to build diagnosis support systems from medical imaging data, such as Chest X-Ray images. However, the shortage of labeled data in the medical field represents one key obstacle to narrow down the performance gap with respect to applications in other image domains. In this work, we investigate the benefits of a curricular Self-Supervised Learning (SSL) pretraining scheme with respect to fully-supervised training regimes for pneumonia recognition on Chest X-Ray images of Covid-19 patients. We show that curricular SSL pretraining, which leverages unlabeled data, outperforms models trained from scratch, or pretrained on ImageNet, indicating the potential of performance gains by SSL pretraining on massive unlabeled datasets. Finally, we demonstrate that top-performing SSLpretrained models show a higher degree of attention in the lung regions, embodying models that may be more robust to possible external confounding factors in the training datasets, identified by previous works.

Replacements for Thu, 26 Jan 23

[34]  arXiv:2102.04780 (replaced) [pdf, other]
Title: Diverse Single Image Generation with Controllable Global Structure
Comments: Published in the Neurocomputing Journal
Journal-ref: Neurocomputing 528(2023)97-112
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
[35]  arXiv:2108.00596 (replaced) [pdf, other]
Title: GTNet:Guided Transformer Network for Detecting Human-Object Interactions
Comments: accepted for presentation in Pattern Recognition and Tracking XXXIV at SPIE commerce+ defence Program
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[36]  arXiv:2111.15363 (replaced) [pdf, other]
Title: Voint Cloud: Multi-View Point Cloud Representation for 3D Understanding
Comments: Accepted at ICLR 2023. The code is available at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[37]  arXiv:2205.03777 (replaced) [pdf, other]
Title: Semi-Cycled Generative Adversarial Networks for Real-World Face Super-Resolution
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[38]  arXiv:2206.04176 (replaced) [pdf, other]
Title: VN-Transformer: Rotation-Equivariant Attention for Vector Neurons
Comments: Published in Transactions on Machine Learning Research (TMLR), 2023; Previous version appeared in Workshop on Machine Learning for Autonomous Driving, Conference on Neural Information Processing Systems (NeurIPS), 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
[39]  arXiv:2206.08171 (replaced) [pdf, other]
Title: K-Radar: 4D Radar Object Detection for Autonomous Driving in Various Weather Conditions
Comments: Accepted at NeurIPS 2022 Datasets and Benchmarks Track
Journal-ref: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks 2022)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[40]  arXiv:2206.11215 (replaced) [pdf, other]
Title: Certifiable 3D Object Pose Estimation: Foundations, Learning Models, and Self-Training
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
[41]  arXiv:2210.08159 (replaced) [pdf, other]
Title: Dynamics-aware Adversarial Attack of Adaptive Neural Networks
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[42]  arXiv:2212.00214 (replaced) [pdf, other]
Title: Test-Time Mixup Augmentation for Data and Class-Dependent Uncertainty Estimation in Deep Learning Image Classification
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[43]  arXiv:2210.00312 (replaced) [pdf, other]
Title: Multimodal Analogical Reasoning over Knowledge Graphs
Comments: Accepted by ICLR 2023
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
[ total of 43 entries: 1-43 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2301, contact, help  (Access key information)