We gratefully acknowledge support from
the Simons Foundation and member institutions.

Image and Video Processing

New submissions

[ total of 23 entries: 1-23 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Mon, 18 Oct 21

[1]  arXiv:2110.07610 [pdf]
Title: Non-contact Atrial Fibrillation Detection from Face Videos by Learning Systolic Peaks
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Objective: We propose a non-contact approach for atrial fibrillation (AF) detection from face videos. Methods: Face videos, electrocardiography (ECG), and contact photoplethysmography (PPG) from 100 healthy subjects and 100 AF patients are recorded. All the videos in the healthy group are labeled as healthy. Videos in the patient group are labeled as AF, sinus rhythm (SR), or atrial flutter (AFL) by cardiologists. We use the 3D convolutional neural network for remote PPG measurement and propose a novel loss function (Wasserstein distance) to use the timing of systolic peaks from contact PPG as the label for our model training. Then a set of heart rate variability (HRV) features are calculated from the inter-beat intervals, and a support vector machine (SVM) classifier is trained with HRV features. Results: Our proposed method can accurately extract systolic peaks from face videos for AF detection. The proposed method is trained with subject-independent 10-fold cross-validation with 30s video clips and tested on two tasks. 1) Classification of healthy versus AF: the accuracy, sensitivity, and specificity are 96.16%, 95.71%, and 96.23%. 2) Classification of SR versus AF: the accuracy, sensitivity, and specificity are 95.31%, 98.66%, and 91.11%. Conclusion: We achieve good performance of non-contact AF detection by learning systolic peaks. Significance: non-contact AF detection can be used for self-screening of AF symptom for suspectable populations at home, or self-monitoring of AF recurrence after treatment for the chronical patients.

[2]  arXiv:2110.07711 [pdf, other]
Title: Gray Matter Segmentation in Ultra High Resolution 7 Tesla ex vivo T2w MRI of Human Brain Hemispheres
Comments: Submitted to IEEE International Symposium on Biomedical Imaging (ISBI) 2022
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Ex vivo MRI of the brain provides remarkable advantages over in vivo MRI for visualizing and characterizing detailed neuroanatomy. However, automated cortical segmentation methods in ex vivo MRI are not well developed, primarily due to limited availability of labeled datasets, and heterogeneity in scanner hardware and acquisition protocols. In this work, we present a high resolution 7 Tesla dataset of 32 ex vivo human brain specimens. We benchmark the cortical mantle segmentation performance of nine neural network architectures, trained and evaluated using manually-segmented 3D patches sampled from specific cortical regions, and show excellent generalizing capabilities across whole brain hemispheres in different specimens, and also on unseen images acquired at different magnetic field strength and imaging sequences. Finally, we provide cortical thickness measurements across key regions in 3D ex vivo human brain images. Our code and processed datasets are publicly available at https://github.com/Pulkit-Khandelwal/picsl-ex-vivo-segmentation.

[3]  arXiv:2110.07745 [pdf, other]
Title: A deep learning model for classification of diabetic retinopathy in eye fundus images based on retinal lesion detection
Comments: 7 pages and 1 figure
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Diabetic retinopathy (DR) is the result of a complication of diabetes affecting the retina. It can cause blindness, if left undiagnosed and untreated. An ophthalmologist performs the diagnosis by screening each patient and analyzing the retinal lesions via ocular imaging. In practice, such analysis is time-consuming and cumbersome to perform. This paper presents a model for automatic DR classification on eye fundus images. The approach identifies the main ocular lesions related to DR and subsequently diagnoses the illness. The proposed method follows the same workflow as the clinicians, providing information that can be interpreted clinically to support the prediction. A subset of the kaggle EyePACS and the Messidor-2 datasets, labeled with ocular lesions, is made publicly available. The kaggle EyePACS subset is used as a training set and the Messidor-2 as a test set for lesions and DR classification models. For DR diagnosis, our model has an area-under-the-curve, sensitivity, and specificity of 0.948, 0.886, and 0.875, respectively, which competes with state-of-the-art approaches.

[4]  arXiv:2110.07768 [pdf, other]
Title: Application of Homomorphic Encryption in Medical Imaging
Subjects: Image and Video Processing (eess.IV); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

In this technical report, we explore the use of homomorphic encryption (HE) in the context of training and predicting with deep learning (DL) models to deliver strict \textit{Privacy by Design} services, and to enforce a zero-trust model of data governance. First, we show how HE can be used to make predictions over medical images while preventing unauthorized secondary use of data, and detail our results on a disease classification task with OCT images. Then, we demonstrate that HE can be used to secure the training of DL models through federated learning, and report some experiments using 3D chest CT-Scans for a nodule detection task.

[5]  arXiv:2110.07878 [pdf, other]
Title: Single volume lung biomechanics from chest computed tomography using a mode preserving generative adversarial network
Comments: 5 pages, 5 figures
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Local tissue expansion of the lungs is typically derived by registering computed tomography (CT) scans acquired at multiple lung volumes. However, acquiring multiple scans incurs increased radiation dose, time, and cost, and may not be possible in many cases, thus restricting the applicability of registration-based biomechanics. We propose a generative adversarial learning approach for estimating local tissue expansion directly from a single CT scan. The proposed framework was trained and evaluated on 2500 subjects from the SPIROMICS cohort. Once trained, the framework can be used as a registration-free method for predicting local tissue expansion. We evaluated model performance across varying degrees of disease severity and compared its performance with two image-to-image translation frameworks - UNet and Pix2Pix. Our model achieved an overall PSNR of 18.95 decibels, SSIM of 0.840, and Spearman's correlation of 0.61 at a high spatial resolution of 1 mm3.

[6]  arXiv:2110.07919 [pdf, other]
Title: Combining CNNs With Transformer for Multimodal 3D MRI Brain Tumor Segmentation With Self-Supervised Pretraining
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

We apply an ensemble of modified TransBTS, nnU-Net, and a combination of both for the segmentation task of the BraTS 2021 challenge. In fact, we change the original architecture of the TransBTS model by adding Squeeze-and-Excitation blocks, an increasing number of CNN layers, replacing positional encoding in Transformer block with a learnable Multilayer Perceptron (MLP) embeddings, which makes Transformer adjustable to any input size during inference. With these modifications, we are able to largely improve TransBTS performance. Inspired by a nnU-Net framework we decided to combine it with our modified TransBTS by changing the architecture inside nnU-Net to our custom model. On the Validation set of BraTS 2021, the ensemble of these approaches achieves 0.8496, 0.8698, 0.9256 Dice score and 15.72, 11.057, 3.374 HD95 for enhancing tumor, tumor core, and whole tumor, correspondingly. Our code is publicly available.

[7]  arXiv:2110.08048 [pdf, other]
Title: Multi-Layer Pseudo-Supervision for Histopathology Tissue Semantic Segmentation using Patch-level Classification Labels
Comments: 15 pages, 10 figures, journal
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

Tissue-level semantic segmentation is a vital step in computational pathology. Fully-supervised models have already achieved outstanding performance with dense pixel-level annotations. However, drawing such labels on the giga-pixel whole slide images is extremely expensive and time-consuming. In this paper, we use only patch-level classification labels to achieve tissue semantic segmentation on histopathology images, finally reducing the annotation efforts. We proposed a two-step model including a classification and a segmentation phases. In the classification phase, we proposed a CAM-based model to generate pseudo masks by patch-level labels. In the segmentation phase, we achieved tissue semantic segmentation by our proposed Multi-Layer Pseudo-Supervision. Several technical novelties have been proposed to reduce the information gap between pixel-level and patch-level annotations. As a part of this paper, we introduced a new weakly-supervised semantic segmentation (WSSS) dataset for lung adenocarcinoma (LUAD-HistoSeg). We conducted several experiments to evaluate our proposed model on two datasets. Our proposed model outperforms two state-of-the-art WSSS approaches. Note that we can achieve comparable quantitative and qualitative results with the fully-supervised model, with only around a 2\% gap for MIoU and FwIoU. By comparing with manual labeling, our model can greatly save the annotation time from hours to minutes. The source code is available at: \url{https://github.com/ChuHan89/WSSS-Tissue}.

[8]  arXiv:2110.08080 [pdf, other]
Title: Multi-modal Aggregation Network for Fast MR Imaging
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Magnetic resonance (MR) imaging is a commonly used scanning technique for disease detection, diagnosis and treatment monitoring. Although it is able to produce detailed images of organs and tissues with better contrast, it suffers from a long acquisition time, which makes the image quality vulnerable to say motion artifacts. Recently, many approaches have been developed to reconstruct full-sampled images from partially observed measurements in order to accelerate MR imaging. However, most of these efforts focus on reconstruction over a single modality or simple fusion of multiple modalities, neglecting the discovery of correlation knowledge at different feature level. In this work, we propose a novel Multi-modal Aggregation Network, named MANet, which is capable of discovering complementary representations from a fully sampled auxiliary modality, with which to hierarchically guide the reconstruction of a given target modality. In our MANet, the representations from the fully sampled auxiliary and undersampled target modalities are learned independently through a specific network. Then, a guided attention module is introduced in each convolutional stage to selectively aggregate multi-modal features for better reconstruction, yielding comprehensive, multi-scale, multi-modal feature fusion. Moreover, our MANet follows a hybrid domain learning framework, which allows it to simultaneously recover the frequency signal in the $k$-space domain as well as restore the image details from the image domain. Extensive experiments demonstrate the superiority of the proposed approach over state-of-the-art MR image reconstruction methods.

[9]  arXiv:2110.08085 [pdf]
Title: Prediction of Lung CT Scores of Systemic Sclerosis by Cascaded Regression Neural Networks
Comments: SPIE 2022 accepted
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Visually scoring lung involvement in systemic sclerosis from CT scans plays an important role in monitoring progression, but its labor intensiveness hinders practical application. We proposed, therefore, an automatic scoring framework that consists of two cascaded deep regression neural networks. The first (3D) network aims to predict the craniocaudal position of five anatomically defined scoring levels on the 3D CT scans. The second (2D) network receives the resulting 2D axial slices and predicts the scores. We used 227 3D CT scans to train and validate the first network, and the resulting 1135 axial slices were used in the second network. Two experts scored independently a subset of data to obtain intra- and interobserver variabilities and the ground truth for all data was obtained in consensus. To alleviate the unbalance in training labels in the second network, we introduced a sampling technique and to increase the diversity of the training samples synthetic data was generated, mimicking ground glass and reticulation patterns. The 4-fold cross validation showed that our proposed network achieved an average MAE of 5.90, 4.66 and 4.49, weighted kappa of 0.66, 0.58 and 0.65 for total score (TOT), ground glass (GG) and reticular pattern (RET), respectively. Our network performed slightly worse than the best experts on TOT and GG prediction but it has competitive performance on RET prediction and has the potential to be an objective alternative for the visual scoring of SSc in CT thorax studies.

Cross-lists for Mon, 18 Oct 21

[10]  arXiv:2110.07608 (cross-list from q-bio.QM) [pdf, other]
Title: 3D Structure from 2D Microscopy images using Deep Learning
Comments: 32 Pages, 12 figures. Awaiting publication in 'Frontiers in Bioinformatics - Computational Bioimaging' - this https URL
Subjects: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Understanding the structure of a protein complex is crucial indetermining its function. However, retrieving accurate 3D structures from microscopy images is highly challenging, particularly as many imaging modalities are two-dimensional. Recent advances in Artificial Intelligence have been applied to this problem, primarily using voxel based approaches to analyse sets of electron microscopy images. Herewe present a deep learning solution for reconstructing the protein com-plexes from a number of 2D single molecule localization microscopy images, with the solution being completely unconstrained. Our convolutional neural network coupled with a differentiable renderer predicts pose and derives a single structure. After training, the network is dis-carded, with the output of this method being a structural model which fits the data-set. We demonstrate the performance of our system on two protein complexes: CEP152 (which comprises part of the proximal toroid of the centriole) and centrioles.

[11]  arXiv:2110.07646 (cross-list from cs.CV) [pdf, other]
Title: Talking Detection In Collaborative Learning Environments
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

We study the problem of detecting talking activities in collaborative learning videos. Our approach uses head detection and projections of the log-magnitude of optical flow vectors to reduce the problem to a simple classification of small projection images without the need for training complex, 3-D activity classification systems. The small projection images are then easily classified using a simple majority vote of standard classifiers. For talking detection, our proposed approach is shown to significantly outperform single activity systems. We have an overall accuracy of 59% compared to 42% for Temporal Segment Network (TSN) and 45% for Convolutional 3D (C3D). In addition, our method is able to detect multiple talking instances from multiple speakers, while also detecting the speakers themselves.

[12]  arXiv:2110.07661 (cross-list from cs.LG) [pdf, other]
Title: Distribution-Free Federated Learning with Conformal Predictions
Subjects: Machine Learning (cs.LG); Image and Video Processing (eess.IV)

Federated learning has attracted considerable interest for collaborative machine learning in healthcare to leverage separate institutional datasets while maintaining patient privacy.
However, additional challenges such as poor calibration and lack of interpretability may also hamper widespread deployment of federated models into clinical practice and lead to user distrust or misuse of ML tools in high-stakes clinical decision-making.
In this paper, we propose to address these challenges by incorporating an adaptive conformal framework into federated learning to ensure distribution-free prediction sets that provide coverage guarantees and uncertainty estimates without requiring any additional modifications to the model or assumptions.
Empirical results on the MedMNIST medical imaging benchmark demonstrate our federated method provide tighter coverage in lower average cardinality over local conformal predictions on 6 different medical imaging benchmark datasets in 2D and 3D multi-class classification tasks.
Further, we correlate class entropy and prediction set size to assess task uncertainty with conformal methods.

[13]  arXiv:2110.07716 (cross-list from cs.CV) [pdf]
Title: Adversarial Scene Reconstruction and Object Detection System for Assisting Autonomous Vehicle
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

In the current computer vision era classifying scenes through video surveillance systems is a crucial task. Artificial Intelligence (AI) Video Surveillance technologies have been advanced remarkably while artificial intelligence and deep learning ascended into the system. Adopting the superior compounds of deep learning visual classification methods achieved enormous accuracy in classifying visual scenes. However, the visual classifiers face difficulties examining the scenes in dark visible areas, especially during the nighttime. Also, the classifiers face difficulties in identifying the contexts of the scenes. This paper proposed a deep learning model that reconstructs dark visual scenes to clear scenes like daylight, and the method recognizes visual actions for the autonomous vehicle. The proposed model achieved 87.3 percent accuracy for scene reconstruction and 89.2 percent in scene understanding and detection tasks.

[14]  arXiv:2110.07728 (cross-list from q-bio.QM) [pdf, other]
Title: Pre-training Molecular Graph Representation with 3D Geometry
Subjects: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

Molecular graph representation learning is a fundamental problem in modern drug and material discovery. Molecular graphs are typically modeled by their 2D topological structures, but it has been recently discovered that 3D geometric information plays a more vital role in predicting molecular functionalities. However, the lack of 3D information in real-world scenarios has significantly impeded the learning of geometric graph representation. To cope with this challenge, we propose the Graph Multi-View Pre-training (GraphMVP) framework where self-supervised learning (SSL) is performed by leveraging the correspondence and consistency between 2D topological structures and 3D geometric views. GraphMVP effectively learns a 2D molecular graph encoder that is enhanced by richer and more discriminative 3D geometry. We further provide theoretical insights to justify the effectiveness of GraphMVP. Finally, comprehensive experiments show that GraphMVP can consistently outperform existing graph SSL methods.

[15]  arXiv:2110.08243 (cross-list from eess.AS) [pdf, other]
Title: Neural Dubber: Dubbing for Silent Videos According to Scripts
Comments: Accepted by NeurIPS 2021
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Image and Video Processing (eess.IV)

Dubbing is a post-production process of re-recording actors' dialogues, which is extensively used in filmmaking and video production. It is usually performed manually by professional voice actors who read lines with proper prosody, and in synchronization with the pre-recorded videos. In this work, we propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task: synthesizing human speech synchronized with the given silent video from the text. Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech. Furthermore, an image-based speaker embedding (ISE) module is developed for the multi-speaker setting, which enables Neural Dubber to generate speech with a reasonable timbre according to the speaker's face. Experiments on the chemistry lecture single-speaker dataset and LRS2 multi-speaker dataset show that Neural Dubber can generate speech audios on par with state-of-the-art TTS models in terms of speech quality. Most importantly, both qualitative and quantitative evaluations show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.

Replacements for Mon, 18 Oct 21

[16]  arXiv:2010.01663 (replaced) [pdf, other]
Title: KiU-Net: Overcomplete Convolutional Architectures for Biomedical Image and Volumetric Segmentation
Comments: Journal Extension of KiU-Net (MICCAI-2020)
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[17]  arXiv:2101.06116 (replaced) [pdf, other]
Title: Hyperspectral Image Classification -- Traditional to Deep Models: A Survey for Future Prospects
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[18]  arXiv:2103.09404 (replaced) [pdf, other]
Title: Collapsible Linear Blocks for Super-Efficient Super Resolution
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[19]  arXiv:2109.10902 (replaced) [pdf, other]
Title: Mixed-supervised segmentation: Confidence maximization helps knowledge distillation
Comments: This article is a journal extension of our paper in IPMI 2021 arXiv:2012.08051 . Currently under review at Medical Image Analysis. Code available at this https URL
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[20]  arXiv:2108.10335 (replaced) [pdf, other]
Title: edge-SR: Super-Resolution For The Masses
Comments: In WACV 2022. Code available in this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
[21]  arXiv:2110.00076 (replaced) [pdf, other]
Title: Quantitative Jones matrix imaging using vectorial Fourier ptychography
Subjects: Optics (physics.optics); Image and Video Processing (eess.IV); Biological Physics (physics.bio-ph)
[22]  arXiv:2110.01660 (replaced) [pdf, other]
Title: HDR-cGAN: Single LDR to HDR Image Translation using Conditional GAN
Comments: Accepted in ICVGIP 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
[23]  arXiv:2110.06284 (replaced) [pdf, other]
Title: Tomographic phase and attenuation extraction for a sample composed of unknown materials using X-ray propagation-based phase-contrast imaging
Comments: 8 pages, 4 figures and 1 table
Subjects: Medical Physics (physics.med-ph); Image and Video Processing (eess.IV); Optics (physics.optics)
[ total of 23 entries: 1-23 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, eess, recent, 2110, contact, help  (Access key information)