We gratefully acknowledge support from
the Simons Foundation and member institutions.

Electrical Engineering and Systems Science

New submissions

[ total of 109 entries: 1-109 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Wed, 16 Jun 21

[1]  arXiv:2106.07645 [pdf]
Title: PhyMask: Robust Sensing of Brain Activity and Physiological Signals During Sleep with an All-textile Eye Mask
Subjects: Signal Processing (eess.SP); Human-Computer Interaction (cs.HC)

Clinical-grade wearable sleep monitoring is a challenging problem since it requires concurrently monitoring brain activity, eye movement, muscle activity, cardio-respiratory features and gross body movements. This requires multiple sensors to be worn at different locations as well as uncomfortable adhesives and discrete electronic components to be placed on the head. As a result, existing wearables either compromise comfort or compromise accuracy in tracking sleep variables. We propose PhyMask, an all-textile sleep monitoring solution that is practical and comfortable for continuous use and that acquires all signals of interest to sleep solely using comfortable textile sensors placed on the head. We show that PhyMask can be used to accurately measure sleep stages and advanced sleep markers such as spindles and k-complexes robustly in the real-world setting. We validate PhyMask against polysomnography and show that it significantly outperforms two commercially-available sleep tracking wearables, Fitbit and Oura Ring.

[2]  arXiv:2106.07737 [pdf]
Title: Optical Wireless Satellite Networks versus Optical Fiber Terrestrial Networks: The Latency Perspective
Comments: Accepted for publication in proceedings of 2021 30th Biennial Symposium on Communications (BSC 2021)
Subjects: Signal Processing (eess.SP)

Formed by using laser inter-satellite links (LISLs) among satellites in upcoming low Earth orbit and very low Earth orbit satellite constellations, optical wireless satellite networks (OWSNs), also known as free-space optical satellite networks, can provide a better alternative to existing optical fiber terrestrial networks (OFTNs) for long-distance inter-continental data communications. The LISLs operate at the speed of light in vacuum in space, which gives OWSNs a crucial advantage over OFTNs in terms of latency. In this paper, we employ the satellite constellation for Phase I of Starlink and LISLs between satellites to simulate an OWSN. Then, we compare the network latency of this OWSN and the OFTN under three different scenarios for long-distance inter-continental data communications. The results show that the OWSN performs better than the OFTN in all scenarios. It is observed that the longer the length of the inter-continental connection between the source and the destination, the better the latency improvement offered by the OWSN compared to OFTN.

[3]  arXiv:2106.07759 [pdf, ps, other]
Title: Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition
Comments: 4 figures, 7 pages; fixed author list going out of margin
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)

In this paper, we introduce the Kaizen framework that uses a continuously improving teacher to generate pseudo-labels for semi-supervised training. The proposed approach uses a teacher model which is updated as the exponential moving average of the student model parameters. This can be seen as a continuous version of the iterative pseudo-labeling approach for semi-supervised training. It is applicable for different training criteria, and in this paper we demonstrate it for frame-level hybrid hidden Markov model - deep neural network (HMM-DNN) models and sequence-level connectionist temporal classification (CTC) based models. The proposed approach shows more than 10% word error rate (WER) reduction over standard teacher-student training and more than 50\% relative WER reduction over 10 hour supervised baseline when using large scale realistic unsupervised public videos in UK English and Italian languages.

[4]  arXiv:2106.07806 [pdf, other]
Title: Highdicom: A Python library for standardized encoding of image annotations and machine learning model outputs in pathology and radiology
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Machine learning is revolutionizing image-based diagnostics in pathology and radiology. ML models have shown promising results in research settings, but their lack of interoperability has been a major barrier for clinical integration and evaluation. The DICOM a standard specifies Information Object Definitions and Services for the representation and communication of digital images and related information, including image-derived annotations and analysis results. However, the complexity of the standard represents an obstacle for its adoption in the ML community and creates a need for software libraries and tools that simplify working with data sets in DICOM format. Here we present the highdicom library, which provides a high-level application programming interface for the Python programming language that abstracts low-level details of the standard and enables encoding and decoding of image-derived information in DICOM format in a few lines of Python code. The highdicom library ties into the extensive Python ecosystem for image processing and machine learning. Simultaneously, by simplifying creation and parsing of DICOM-compliant files, highdicom achieves interoperability with the medical imaging systems that hold the data used to train and run ML models, and ultimately communicate and store model outputs for clinical use. We demonstrate through experiments with slide microscopy and computed tomography imaging, that, by bridging these two ecosystems, highdicom enables developers to train and evaluate state-of-the-art ML models in pathology and radiology while remaining compliant with the DICOM standard and interoperable with clinical systems at all stages. To promote standardization of ML research and streamline the ML model development and deployment process, we made the library available free and open-source.

[5]  arXiv:2106.07879 [pdf, ps, other]
Title: A Lightweight ReLU-Based Feature Fusion for Aerial Scene Classification
Comments: To be presented in IEEE ICIP'21
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

In this paper, we propose a transfer-learning based model construction technique for the aerial scene classification problem. The core of our technique is a layer selection strategy, named ReLU-Based Feature Fusion (RBFF), that extracts feature maps from a pretrained CNN-based single-object image classification model, namely MobileNetV2, and constructs a model for the aerial scene classification task. RBFF stacks features extracted from the batch normalization layer of a few selected blocks of MobileNetV2, where the candidate blocks are selected based on the characteristics of the ReLU activation layers present in those blocks. The feature vector is then compressed into a low-dimensional feature space using dimension reduction algorithms on which we train a low-cost SVM classifier for the classification of the aerial images. We validate our choice of selected features based on the significance of the extracted features with respect to our classification pipeline. RBFF remarkably does not involve any training of the base CNN model except for a few parameters for the classifier, which makes the technique very cost-effective for practical deployments. The constructed model despite being lightweight outperforms several recently proposed models in terms of accuracy for a number of aerial scene datasets.

[6]  arXiv:2106.07889 [pdf, other]
Title: UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation
Comments: Accepted to INTERSPEECH 2021
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

Most neural vocoders employ band-limited mel-spectrograms to generate waveforms. If full-band spectral features are used as the input, the vocoder can be provided with as much acoustic information as possible. However, in some models employing full-band mel-spectrograms, an over-smoothing problem occurs as part of which non-sharp spectrograms are generated. To address this problem, we propose UnivNet, a neural vocoder that synthesizes high-fidelity waveforms in real time. Inspired by works in the field of voice activity detection, we added a multi-resolution spectrogram discriminator that employs multiple linear spectrogram magnitudes computed using various parameter sets. Using full-band mel-spectrograms as input, we expect to generate high-resolution signals by adding a discriminator that employs spectrograms of multiple resolutions as the input. In an evaluation on a dataset containing information on hundreds of speakers, UnivNet obtained the best objective and subjective results among competing models for both seen and unseen speakers. These results, including the best subjective score for text-to-speech, demonstrate the potential for fast adaptation to new speakers without a need for training from scratch.

[7]  arXiv:2106.07910 [pdf, other]
Title: Wavelength-based Attributed Deep Neural Network for Underwater Image Restoration
Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Underwater images, in general, suffer from low contrast and high color distortions due to the non-uniform attenuation of the light as it propagates through the water. In addition, the degree of attenuation varies with the wavelength resulting in the asymmetric traversing of colors. Despite the prolific works for underwater image restoration (UIR) using deep learning, the above asymmetricity has not been addressed in the respective network engineering. As the first novelty, this paper shows that attributing the right receptive field size (context) based on the traversing range of the color channel may lead to a substantial performance gain for the task of UIR. Further, it is important to suppress the irrelevant multi-contextual features and increase the representational power of the model. Therefore, as a second novelty, we have incorporated an attentive skip mechanism to adaptively refine the learned multi-contextual features. The proposed framework, called Deep WaveNet, is optimized using the traditional pixel-wise and feature-based cost functions. An extensive set of experiments have been carried out to show the efficacy of the proposed scheme over existing best-published literature on benchmark datasets. More importantly, we have demonstrated a comprehensive validation of enhanced images across various high-level vision tasks, e.g., underwater image semantic segmentation, and diver's 2D pose estimation. A sample video to exhibit our real-world performance is available at \url{https://www.youtube.com/watch?v=8qtuegBdfac}.

[8]  arXiv:2106.07919 [pdf, other]
Title: A stochastic metapopulation state-space approach to modeling and estimating Covid-19 spread
Comments: 17 pages, 5 figures
Subjects: Signal Processing (eess.SP); Social and Information Networks (cs.SI); Populations and Evolution (q-bio.PE)

Mathematical models are widely recognized as an important tool for analyzing and understanding the dynamics of infectious disease outbreaks, predict their future trends, and evaluate public health intervention measures for disease control and elimination. We propose a novel stochastic metapopulation state-space model for COVID-19 transmission, based on a discrete-time spatio-temporal susceptible/exposed/infected/recovered/deceased (SEIRD) model. The proposed framework allows the hidden SEIRD states and unknown transmission parameters to be estimated from noisy, incomplete time series of reported epidemiological data, by application of unscented Kalman filtering (UKF), maximum-likelihood adaptive filtering, and metaheuristic optimization. Experiments using both synthetic data and real data from the Fall 2020 Covid-19 wave in the state of Texas demonstrate the effectiveness of the proposed model.

[9]  arXiv:2106.07939 [pdf, other]
Title: Attention-based distributed speech enhancement for unconstrained microphone arrays with varying number of nodes
Authors: Nicolas Furnon (MULTISPEECH), Romain Serizel (MULTISPEECH), Slim Essid (ADASP), Irina Illina (MULTISPEECH)
Journal-ref: European Signal Processing Conference (EUSIPCO), IEEE, Aug 2021, Dublin, Ireland
Subjects: Signal Processing (eess.SP); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Speech enhancement promises higher efficiency in ad-hoc microphone arrays than in constrained microphone arrays thanks to the wide spatial coverage of the devices in the acoustic scene. However, speech enhancement in ad-hoc microphone arrays still raises many challenges. In particular, the algorithms should be able to handle a variable number of microphones, as some devices in the array might appear or disappear. In this paper, we propose a solution that can efficiently process the spatial information captured by the different devices of the microphone array, while being robust to a link failure. To do this, we use an attention mechanism in order to put more weight on the relevant signals sent throughout the array and to neglect the redundant or empty channels.

[10]  arXiv:2106.07953 [pdf, other]
Title: Learning to Compensate: A Deep Neural Network Framework for 5G Power Amplifier Compensation
Comments: IEEE International Conference on Communications (ICC) 2021
Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG)

Owing to the complicated characteristics of 5G communication system, designing RF components through mathematical modeling becomes a challenging obstacle. Moreover, such mathematical models need numerous manual adjustments for various specification requirements. In this paper, we present a learning-based framework to model and compensate Power Amplifiers (PAs) in 5G communication. In the proposed framework, Deep Neural Networks (DNNs) are used to learn the characteristics of the PAs, while, correspondent Digital Pre-Distortions (DPDs) are also learned to compensate for the nonlinear and memory effects of PAs. On top of the framework, we further propose two frequency domain losses to guide the learning process to better optimize the target, compared to naive time domain Mean Square Error (MSE). The proposed framework serves as a drop-in replacement for the conventional approach. The proposed approach achieves an average of 56.7% reduction of nonlinear and memory effects, which converts to an average of 16.3% improvement over a carefully-designed mathematical model, and even reaches 34% enhancement in severe distortion scenarios.

[11]  arXiv:2106.07970 [pdf, ps, other]
Title: Jamming Detection With Subcarrier Blanking for 5G and Beyond in Industry 4.0 Scenarios
Comments: Accepted at the IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Sep. 2021
Subjects: Signal Processing (eess.SP); Information Theory (cs.IT)

Security attacks at the physical layer, in the form of radio jamming for denial of service, are an increasing threat in the Industry 4.0 scenarios. In this paper, we consider the problem of jamming detection in 5G-and-beyond communication systems and propose a defense mechanism based on pseudo-random blanking of subcarriers with orthogonal frequency division multiplexing (OFDM). We then design a detector by applying the generalized likelihood ratio test (GLRT) on those subcarriers. We finally evaluate the performance of the proposed technique against a smart jammer, which is pursuing one of the following objectives: maximize stealthiness, minimize spectral efficiency (SE) with mobile broadband (MBB) type of traffic, and maximize block error rate (BLER) with ultra-reliable low-latency communications (URLLC). Numerical results show that a smart jammer a) needs to compromise between missed detection (MD) probability and SE reduction with MBB and b) can achieve low detectability and high system performance degradation with URLLC only if it has sufficiently high power.

[12]  arXiv:2106.07972 [pdf]
Title: SRIB Submission to Interspeech 2021 DiCOVA Challenge
Comments: 5 pages, 5 figures
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

The COVID-19 pandemic has resulted in more than 125 million infections and more than 2.7 million casualties. In this paper, we attempt to classify covid vs non-covid cough sounds using signal processing and deep learning methods. Air turbulence, the vibration of tissues, movement of fluid through airways, opening, and closure of glottis are some of the causes for the production of the acoustic sound signals during cough. Does the COVID-19 alter the acoustic characteristics of breath, cough, and speech sounds produced through the respiratory system? This is an open question waiting for answers. In this paper, we incorporated novel data augmentation methods for cough sound augmentation and multiple deep neural network architectures and methods along with handcrafted features. Our proposed system gives 14% absolute improvement in area under the curve (AUC). The proposed system is developed as part of Interspeech 2021 special sessions and challenges viz. diagnosing of COVID-19 using acoustics (DiCOVA). Our proposed method secured the 5th position on the leaderboard among 29 participants.

[13]  arXiv:2106.07977 [pdf, other]
Title: An Alternative Statistical Characterization of TWDP Fading Model
Comments: Submitted to IEEE Transactions on Wireless Communications
Subjects: Signal Processing (eess.SP)

Two-wave with diffuse power (TWDP) is one of the most promising models for description of small-scale fading effects in emerging wireless networks. However, its current statistical characterization has several fundamental issues. Primarily, conventional TWDP parameterization is not in accordance with the model's underlying physical mechanisms. In addition, available TWDP expressions for PDF, CDF, and MGF are given either in integral or approximate forms, or as mathematically untractable closed-form expressions. Consequently, the existing TWDP statistical characterization does not allow accurate evaluation of system performance (such as error and outage probability) in all fading conditions for most modulation and diversity techniques. In this paper, the existing statistical characterization of the TWDP fading model is improved by overcoming some of the noticed issues. In this regard, physically justified TWDP parameterization is proposed and used for further calculations. Additionally, exact infinite-series PDF and CDF are introduced. Based on these expressions, the exact MGF of the SNR is derived in form suitable for mathematical manipulations. The applicability of the proposed MGF for derivation of the exact average symbol error probability (ASEP) is demonstrated with the example of M-ary PSK modulation. Therefore, in this paper, M-ary PSK ASEP is derived as an explicit expression for the first time in the literature. The derived expression is further simplified for large SNR values in order to obtain a closed-form asymptotic ASEP, which is shown to be applicable for SNR > 20 dB. All proposed expressions are verified by Monte Carlo simulation in a variety of TWDP fading conditions.

[14]  arXiv:2106.07985 [pdf]
Title: Impedance-optical Dual-modal Cell Culture Imaging with Learning-based Information Fusion
Subjects: Image and Video Processing (eess.IV); Medical Physics (physics.med-ph)

While Electrical Impedance Tomography (EIT) has found many biomedicine applications, a better resolution is needed to provide quantitative analysis for tissue engineering and regenerative medicine. This paper proposes an impedance-optical dual-modal imaging framework, which is mainly aimed at high-quality 3D cell culture imaging and can be extended to other tissue engineering applications. The framework comprises three components, i.e., an impedance-optical dual-modal sensor, the guidance image processing algorithm, and a deep learning model named multi-scale feature cross fusion network (MSFCF-Net) for information fusion. The MSFCF-Net has two inputs, i.e., the EIT measurement and a binary mask image generated by the guidance image processing algorithm, whose input is an RGB microscopic image. The network then effectively fuses the information from the two different imaging modalities and generates the final conductivity image. We assess the performance of the proposed dual-modal framework by numerical simulation and MCF-7 cell imaging experiments. The results show that the proposed method could significantly improve image quality, indicating that impedance-optical joint imaging has the potential to reveal the structural and functional information of tissue-level targets simultaneously.

[15]  arXiv:2106.07988 [pdf, other]
Title: Massive Wireless Energy Transfer with Statistical CSI Beamforming
Comments: Accepted for publication in the IEEE Journal of Selected Topics in Signal Processing, Special Issue on Signal Processing Advances in Wireless Transmission of Information and Power, to be published in Sep. 2021
Subjects: Signal Processing (eess.SP)

Wireless energy transfer (WET) is a promising solution to enable massive machine-type communications (mMTC) with low-complexity and low-powered wireless devices. Given the energy restrictions of the devices, instant channel state information at the transmitter (CSIT) is not expected to be available in practical WET-enabled mMTC. However, because it is common that the terminals appear spatially clustered, some degree of spatial correlation between their channels to the base station (BS) is expected to occur. The paper considers a massive antenna array at the BS for WET that only has access to i) the first and second order statistics of the Rician channel component of the multiple-input multiple-output (MIMO) channel and also to ii) the line-of-sight MIMO component. The optimal precoding scheme that maximizes the total energy available to the single-antenna devices is derived considering a continuous alphabet for the precoders, permitting any modulated or deterministic waveform. This may lead to some devices in the clusters being assigned a low fraction of the total available power in the cluster, creating a rather uneven situation among them. Consequently, a fairness criterion is introduced, imposing a minimum amount of power allocated to the terminals. A piece-wise linear harvesting circuit is considered at the terminals, with both saturation and a minimum sensitivity, and a constrained version of the precoder is also proposed by solving a non-linear programming problem. A paramount benefit of the constrained precoder is the encompassment of fairness in the power allocation to the different clusters. Moreover, given the polynomial complexity increase of the proposed unconstrained precoder, and the observed linear gain of the system's available sum-power with an increasing number of antennas at the ULA, the use of massive antenna arrays is desirable.

[16]  arXiv:2106.07994 [pdf, other]
Title: Multi-channel Opus compression for far-field automatic speech recognition with a fixed bitrate budget
Comments: Accepted at Interspeech 2021
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

Automatic speech recognition (ASR) in the cloud allows the use of larger models and more powerful multi-channel signal processing front-ends compared to on-device processing. However, it also adds an inherent latency due to the transmission of the audio signal, especially when transmitting multiple channels of a microphone array. One way to reduce the network bandwidth requirements is client-side compression with a lossy codec such as Opus. However, this compression can have a detrimental effect especially on multi-channel ASR front-ends, due to the distortion and loss of spatial information introduced by the codec. In this publication, we propose an improved approach for the compression of microphone array signals based on Opus, using a modified joint channel coding approach and additionally introducing a multi-channel spatial decorrelating transform to reduce redundancy in the transmission. We illustrate the effect of the proposed approach on the spatial information retained in multi-channel signals after compression, and evaluate the performance on far-field ASR with a multi-channel beamforming front-end. We demonstrate that our approach can lead to a 37.5 % bitrate reduction or a 5.1 % relative word error rate reduction for a fixed bitrate budget in a seven channel setup.

[17]  arXiv:2106.07996 [pdf, other]
Title: Over-the-Air Equalization with Reconfigurable Intelligent Surfaces
Subjects: Signal Processing (eess.SP)

Reconfigurable intelligent surface (RIS)-empowered communications is on the rise and is a promising technology envisioned to aid in 6G and beyond wireless communication networks. RISs can manipulate impinging waves through their electromagnetic elements enabling some sort of a control over the wireless channel. In this paper, the potential of RIS technology is explored to perform equalization over-the-air for frequency-selective channels whereas, equalization is generally conducted at either the transmitter or receiver in conventional communication systems. Specifically, with the aid of an RIS, the frequency-selective channel from the transmitter to the RIS is transformed to a frequency-flat channel through elimination of inter-symbol interference (ISI) components at the receiver. ISI is eliminated by adjusting the phases of impinging signals particularly to maximize the incoming signal of the strongest tap. First, a general end-to-end system model is provided and a continuous to discrete-time signal model is presented. Subsequently, a probabilistic analysis for the elimination of ISI terms is conducted and reinforced with computer simulations. Furthermore, a theoretical error probability analysis is performed along with computer simulations. It is demonstrated that with the proposed method, ISI can successfully be eliminated and the RIS-aided communication channel can be converted from frequency-selective to frequency-flat.

[18]  arXiv:2106.08008 [pdf, other]
Title: Towards Long-term Non-invasive Monitoring for Epilepsy via Wearable EEG Devices
Comments: 4 pages, 3 figures, 2 tables, pre-print
Subjects: Signal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

We present the implementation of seizure detection algorithms based on a minimal number of EEG channels on a parallel ultra-low-power embedded platform. The analyses are based on the CHB-MIT dataset, and include explorations of different classification approaches (Support Vector Machines, Random Forest, Extra Trees, AdaBoost) and different pre/post-processing techniques to maximize sensitivity while guaranteeing no false alarms. We analyze global and subject-specific approaches, considering all 23-electrodes or only 4 temporal channels. For 8s window size and subject-specific approach, we report zero false positives and 100% sensitivity. These algorithms are parallelized and optimized for a parallel ultra-low power (PULP) platform, enabling 300h of continuous monitoring on a 300 mAh battery, in a wearable form factor and power budget. These results pave the way for the implementation of affordable, wearable, long-term epilepsy monitoring solutions with low false-positive rates and high sensitivity, meeting both patient and caregiver requirements.

[19]  arXiv:2106.08094 [pdf, other]
Title: Cine-MRI detection of abdominal adhesions with spatio-temporal deep learning
Comments: Accepted at MIDL 2021 as short paper
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Adhesions are an important cause of chronic pain following abdominal surgery. Recent developments in abdominal cine-MRI have enabled the non-invasive diagnosis of adhesions. Adhesions are identified on cine-MRI by the absence of sliding motion during movement. Diagnosis and mapping of adhesions improves the management of patients with pain. Detection of abdominal adhesions on cine-MRI is challenging from both a radiological and deep learning perspective. We focus on classifying presence or absence of adhesions in sagittal abdominal cine-MRI series. We experimented with spatio-temporal deep learning architectures centered around a ConvGRU architecture. A hybrid architecture comprising a ResNet followed by a ConvGRU model allows to classify a whole time-series. Compared to a stand-alone ResNet with a two time-point (inspiration/expiration) input, we show an increase in classification performance (AUROC) from 0.74 to 0.83 ($p<0.05$). Our full temporal classification approach adds only a small amount (5%) of parameters to the entire architecture, which may be useful for other medical imaging problems with a temporal dimension.

[20]  arXiv:2106.08107 [pdf, other]
Title: ResDepth: A Deep Prior For 3D Reconstruction From High-resolution Satellite Images
Comments: Under review
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Modern optical satellite sensors enable high-resolution stereo reconstruction from space. But the challenging imaging conditions when observing the Earth from space push stereo matching to its limits. In practice, the resulting digital surface models (DSMs) are fairly noisy and often do not attain the accuracy needed for high-resolution applications such as 3D city modeling. Arguably, stereo correspondence based on low-level image similarity is insufficient and should be complemented with a-priori knowledge about the expected surface geometry beyond basic local smoothness. To that end, we introduce ResDepth, a convolutional neural network that learns such an expressive geometric prior from example data. ResDepth refines an initial, raw stereo DSM while conditioning the refinement on the images. I.e., it acts as a smart, learned post-processing filter and can seamlessly complement any stereo matching pipeline. In a series of experiments, we find that the proposed method consistently improves stereo DSMs both quantitatively and qualitatively. We show that the prior encoded in the network weights captures meaningful geometric characteristics of urban design, which also generalize across different districts and even from one city to another. Moreover, we demonstrate that, by training on a variety of stereo pairs, ResDepth can acquire a sufficient degree of invariance against variations in imaging conditions and acquisition geometry.

[21]  arXiv:2106.08124 [pdf, other]
Title: Quality assessment methods for perceptual video compression
Subjects: Image and Video Processing (eess.IV)

This paper describes a quality assessment model for perceptual video compression applications (PVM), which stimulates visual masking and distortion-artefact perception using an adaptive combination of noticeable distortions and blurring artefacts. The method shows significant improvement over existing quality metrics based on the VQEG database, and provides compatibility with in-loop rate-quality optimisation for next generation video codecs due to its latency and complexity attributes. Performance comparison are validated against a range of different distortion types.

[22]  arXiv:2106.08126 [pdf, other]
Title: Dialectal Speech Recognition and Translation of Swiss German Speech to Standard German Text: Microsoft's Submission to SwissText 2021
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)

This paper describes the winning approach in the public SwissText 2021 competition on dialect recognition and translation of Swiss German speech to standard German text. Swiss German refers to the multitude of Alemannic dialects spoken in the German-speaking parts of Switzerland. Swiss German differs significantly from standard German in pronunciation, word inventory and grammar. It is mostly incomprehensible to native German speakers. Moreover, it lacks a standardized written script. To solve the challenging task, we propose a hybrid automatic speech recognition system with a lexicon that incorporates translations, a 1st pass language model that deals with Swiss German particularities, a transfer-learned acoustic model and a strong neural language model for 2nd pass rescoring. Our submission reaches 46.04% BLEU on a blind conversational test set and outperforms the second best competitor by a 12% relative margin.

[23]  arXiv:2106.08141 [pdf, other]
Title: An adaptive Lagrange multiplier determination method for rate-distortion optimisation in hybrid video codecs
Subjects: Image and Video Processing (eess.IV)

This paper describes an adaptive Lagrange multiplier determination method for rate-quality optimisation in video compression. Inspired by the experimental results of a Lagrange multiplier selection test, the presented approach adaptively estimates the optimum Lagrange multiplier for different video content, based on distortion statistics of recently encoded frames. The proposed algorithm has been fully integrated into both the H.264 and HEVC reference codecs, and is used in rate-distortion optimisation for encoding B frames. The results show promising (up to 11% on the sequences tested) overall bitrate savings, for a minimal increase in complexity, on various types of test content based on Bjontegaard delta measurements.

[24]  arXiv:2106.08147 [pdf, other]
Title: Perceptually-inspired super-resolution of compressed videos
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Spatial resolution adaptation is a technique which has often been employed in video compression to enhance coding efficiency. This approach encodes a lower resolution version of the input video and reconstructs the original resolution during decoding. Instead of using conventional up-sampling filters, recent work has employed advanced super-resolution methods based on convolutional neural networks (CNNs) to further improve reconstruction quality. These approaches are usually trained to minimise pixel-based losses such as Mean-Squared Error (MSE), despite the fact that this type of loss metric does not correlate well with subjective opinions. In this paper, a perceptually-inspired super-resolution approach (M-SRGAN) is proposed for spatial up-sampling of compressed video using a modified CNN model, which has been trained using a generative adversarial network (GAN) on compressed content with perceptual loss functions. The proposed method was integrated with HEVC HM 16.20, and has been evaluated on the JVET Common Test Conditions (UHD test sequences) using the Random Access configuration. The results show evident perceptual quality improvement over the original HM 16.20, with an average bitrate saving of 35.6% (Bj{\o}ntegaard Delta measurement) based on a perceptual quality metric, VMAF.

[25]  arXiv:2106.08151 [pdf, other]
Title: EuroCrops: A Pan-European Dataset for Time Series Crop Type Classification
Comments: 4 pages, website: this https URL
Journal-ref: Proc. of the 2021 conference on Big Data from Space (BiDS21), 2021, 5, 125-128
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

We present EuroCrops, a dataset based on self-declared field annotations for training and evaluating methods for crop type classification and mapping, together with its process of acquisition and harmonisation. By this, we aim to enrich the research efforts and discussion for data-driven land cover classification via Earth observation and remote sensing. Additionally, through inclusion of self-declarations gathered in the scope of subsidy control from all countries of the European Union (EU), this dataset highlights the difficulties and pitfalls one comes across when operating on a transnational level. We, therefore, also introduce a new taxonomy scheme, HCAT-ID, that aspires to capture all the aspects of reference data originating from administrative and agency databases. To address researchers from both the remote sensing and the computer vision and machine learning communities, we publish the dataset in different formats and processing levels.

[26]  arXiv:2106.08152 [pdf, other]
Title: Enhanced spatial resolution through DFT rederivations of X-ray phase retrieval algorithms
Subjects: Image and Video Processing (eess.IV); Medical Physics (physics.med-ph)

Propagation-based phase-contrast imaging, used in conjunction with the phase retrieval algorithm based on the Transport-of-Intensity Equation (TIE) (Paganin et al., 2002), is commonly used to improve the sensitivity of X-ray imaging. Recently, a `Generalised Paganin Method' algorithm was published to correct the tendency of the TIE algorithm to over-blur images. The article, Paganin et al. 2020, provided a derivation of the new method and demonstrated a difference in the level of blurring applied by each algorithm. In this manuscript, we quantify the spatial resolution improvement and describe the optimal experimental conditions to observe this improvement. We link the effectiveness of the spatial resolution improvement to the imaging point spread function (PSF), incorporating the PSF to compare the blurring applied by each algorithm. We then validate this model through measurements of spatial resolution in experimental data imaging plastic phantoms and biological tissue, using detectors with different PSFs. By analysing edge-spread functions in CT data captured with indirect detectors with PSFs of several pixels in extent, we show negligible spatial resolution improvement when using the generalised Paganin method. However, a clear improvement in spatial resolution, up to 17%, was observed with direct detectors having PSFs of approximately one pixel in extent. Additionally, we demonstrate clear visual improvement in resolution in CT slices of rat lungs. Finally, we demonstrate the versatility of this improvement by generalising other phase retrieval algorithms, namely for multi-material samples and for spectral decomposition using propagation-based phase contrast, and experimentally verify improvements in spatial resolution.

[27]  arXiv:2106.08153 [pdf]
Title: Now You See It, Now You Dont: Adversarial Vulnerabilities in Computational Pathology
Comments: 10 pages
Subjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG)

Deep learning models are routinely employed in computational pathology (CPath) for solving problems of diagnostic and prognostic significance. Typically, the generalization performance of CPath models is analyzed using evaluation protocols such as cross-validation and testing on multi-centric cohorts. However, to ensure that such CPath solutions are robust and safe for use in a clinical setting, a critical analysis of their predictive performance and vulnerability to adversarial attacks is required, which is the focus of this paper. Specifically, we show that a highly accurate model for classification of tumour patches in pathology images (AUC > 0.95) can easily be attacked with minimal perturbations which are imperceptible to lay humans and trained pathologists alike. Our analytical results show that it is possible to generate single-instance white-box attacks on specific input images with high success rate and low perturbation energy. Furthermore, we have also generated a single universal perturbation matrix using the training dataset only which, when added to unseen test images, results in forcing the trained neural network to flip its prediction labels with high confidence at a success rate of > 84%. We systematically analyze the relationship between perturbation energy of an adversarial attack, its impact on morphological constructs of clinical significance, their perceptibility by a trained pathologist and saliency maps obtained using deep learning models. Based on our analysis, we strongly recommend that computational pathology models be critically analyzed using the proposed adversarial validation strategy prior to clinical adoption.

[28]  arXiv:2106.08174 [pdf]
Title: Automatic linear measurements of the fetal brain on MRI with deep neural networks
Comments: 15 pages, 8 figures, presented in CARS 2020, submitted to IJCARS
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Timely, accurate and reliable assessment of fetal brain development is essential to reduce short and long-term risks to fetus and mother. Fetal MRI is increasingly used for fetal brain assessment. Three key biometric linear measurements important for fetal brain evaluation are Cerebral Biparietal Diameter (CBD), Bone Biparietal Diameter (BBD), and Trans-Cerebellum Diameter (TCD), obtained manually by expert radiologists on reference slices, which is time consuming and prone to human error. The aim of this study was to develop a fully automatic method computing the CBD, BBD and TCD measurements from fetal brain MRI. The input is fetal brain MRI volumes which may include the fetal body and the mother's abdomen. The outputs are the measurement values and reference slices on which the measurements were computed. The method, which follows the manual measurements principle, consists of five stages: 1) computation of a Region Of Interest that includes the fetal brain with an anisotropic 3D U-Net classifier; 2) reference slice selection with a Convolutional Neural Network; 3) slice-wise fetal brain structures segmentation with a multiclass U-Net classifier; 4) computation of the fetal brain midsagittal line and fetal brain orientation, and; 5) computation of the measurements. Experimental results on 214 volumes for CBD, BBD and TCD measurements yielded a mean $L_1$ difference of 1.55mm, 1.45mm and 1.23mm respectively, and a Bland-Altman 95% confidence interval ($CI_{95}$) of 3.92mm, 3.98mm and 2.25mm respectively. These results are similar to the manual inter-observer variability. The proposed automatic method for computing biometric linear measurements of the fetal brain from MR imaging achieves human level performance. It has the potential of being a useful method for the assessment of fetal brain biometry in normal and pathological cases, and of improving routine clinical practice.

[29]  arXiv:2106.08176 [pdf, other]
Title: Automated triaging of head MRI examinations using convolutional neural networks
Comments: Accepted as an oral presentation at Medical Imaging with Deep Learning (MIDL) 2021
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

The growing demand for head magnetic resonance imaging (MRI) examinations, along with a global shortage of radiologists, has led to an increase in the time taken to report head MRI scans around the world. For many neurological conditions, this delay can result in increased morbidity and mortality. An automated triaging tool could reduce reporting times for abnormal examinations by identifying abnormalities at the time of imaging and prioritizing the reporting of these scans. In this work, we present a convolutional neural network for detecting clinically-relevant abnormalities in $\text{T}_2$-weighted head MRI scans. Using a validated neuroradiology report classifier, we generated a labelled dataset of 43,754 scans from two large UK hospitals for model training, and demonstrate accurate classification (area under the receiver operating curve (AUC) = 0.943) on a test set of 800 scans labelled by a team of neuroradiologists. Importantly, when trained on scans from only a single hospital the model generalized to scans from the other hospital ($\Delta$AUC $\leq$ 0.02). A simulation study demonstrated that our model would reduce the mean reporting time for abnormal examinations from 28 days to 14 days and from 9 days to 5 days at the two hospitals, demonstrating feasibility for use in a clinical triage environment.

[30]  arXiv:2106.08180 [pdf, other]
Title: On the Use of HAPS to Increase Secrecy Performance in Satellite Networks
Subjects: Signal Processing (eess.SP)

In this paper, we investigate the secrecy performance of radio frequency (RF) eavesdropping for a high altitude platform station (HAPS) aided satellite communication (SatCom) system. More precisely, we propose a new SatCom scheme where a HAPS node is used as an intermediate relay to transmit the satellite's signal to the ground station (GS). In this network, free-space optical (FSO) communication is adopted between HAPS and satellite, whereas RF communication is used between HAPS and GS as the line-of-sight (LoS) communication cannot be established. To quantify the overall secrecy performance of the proposed scheme, closed-form secrecy outage probability (SOP) and the probability of positive secrecy capacity (PPSC) expressions are derived. Moreover, we investigate the effect of pointing error and shadowing severity parameters. Finally, design guidelines that can be useful in the design of practical SatCom networks are presented.

[31]  arXiv:2106.08197 [pdf, other]
Title: Physical Layer Security Framework for Optical Non-Terrestrial Networks
Subjects: Signal Processing (eess.SP)

In this work, we propose a new physical layer security framework for optical space networks. More precisely, we consider two practical eavesdropping scenarios: free-space optical (FSO) eavesdropping in the space and FSO eavesdropping in the air. In the former, we assume that a high altitude platform station (HAPS) is trying to capture the confidential information from the low earth orbit (LEO) satellite, whereas in the latter, an unmanned aerial vehicle (UAV) eavesdropper is trying to intercept the confidential information from the HAPS node. To quantify the overall performance of both scenarios, we obtain closed-form secrecy outage probability (SOP) and probability of positive secrecy capacity (PPSC) expressions and validate with Monte Carlo simulations. Furthermore, we provide important design guidelines that can be helpful in the design of secure non-terrestrial networks.

[32]  arXiv:2106.08201 [pdf]
Title: How to Determine an Optimal Noise Subspace?
Authors: Kaijie Xu
Subjects: Signal Processing (eess.SP); Systems and Control (eess.SY)

The Multiple Signal Classification (MUSIC) algorithm based on the orthogonality between the signal subspace and noise subspace is one of the most frequently used method in the estimation of Direction Of Arrival (DOA), and its performance of DOA estimation mainly depends on the accuracy of the noise subspace. In the most existing researches, the noise subspace is formed by (defined as) the eigenvectors corresponding to all small eigenvalues of the array output covariance matrix. However, we found that the estimation of DOA through the noise subspace in the traditional formation is not optimal in almost all cases, and using a partial noise subspace can always obtain optimal estimation results. In other words, the subspace spanned by the eigenvectors corresponding to a part of the small eigenvalues is more representative of the noise subspace. We demonstrate this conclusion through a number of experiments. Thus, it seems that which and how many eigenvectors should be selected to form the partial noise subspace would be an interesting issue. In addition, this research poses a much general problem: how to select eigenvectors to determine an optimal noise subspace?

[33]  arXiv:2106.08211 [pdf, other]
Title: E2E-based Multi-task Learning Approach to Joint Speech and Accent Recognition
Subjects: Audio and Speech Processing (eess.AS)

In this paper, we propose a single multi-task learning framework to perform End-to-End (E2E) speech recognition (ASR) and accent recognition (AR) simultaneously. The proposed framework is not only more compact but can also yield comparable or even better results than standalone systems. Specifically, we found that the overall performance is predominantly determined by the ASR task, and the E2E-based ASR pretraining is essential to achieve improved performance, particularly for the AR task. Additionally, we conduct several analyses of the proposed method. First, though the objective loss for the AR task is much smaller compared with its counterpart of ASR task, a smaller weighting factor with the AR task in the joint objective function is necessary to yield better results for each task. Second, we found that sharing only a few layers of the encoder yields better AR results than sharing the overall encoder. Experimentally, the proposed method produces WER results close to the best standalone E2E ASR ones, while it achieves 7.7% and 4.2% relative improvement over standalone and single-task-based joint recognition methods on test set for accent recognition respectively.

[34]  arXiv:2106.08274 [pdf, other]
Title: Elasticity Based Demand Forecasting and Price Optimization for Online Retail
Subjects: Systems and Control (eess.SY); Computational Engineering, Finance, and Science (cs.CE); Optimization and Control (math.OC)

We study a problem of an online retailer who observes the unit sales of a product, and dynamically changes the retail price, in order to maximize the expected revenue. Assuming the demand of the product is price sensitive, we are interested in the optimal pricing policy when future demand is uncertain. We build a system to investigate the relationship between retail price and demand and estimate the demand function. The system predicts demand and revenue at a given retail price. We formulate a revenue maximization problem over a discrete finite time horizon with discrete retail price. The optimal pricing policy is solved based on the predicted demand and revenue values. With computational experiments, we investigate the effect of optimal pricing policy to inventory management.

[35]  arXiv:2106.08313 [pdf, other]
Title: A Study into Pre-training Strategies for Spoken Language Understanding on Dysarthric Speech
Comments: Accepted by Interspeech 2021
Subjects: Audio and Speech Processing (eess.AS)

End-to-end (E2E) spoken language understanding (SLU) systems avoid an intermediate textual representation by mapping speech directly into intents with slot values. This approach requires considerable domain-specific training data. In low-resource scenarios this is a major concern, e.g., in the present study dealing with SLU for dysarthric speech. Pretraining part of the SLU model for automatic speech recognition targets helps but no research has shown to which extent SLU on dysarthric speech benefits from knowledge transferred from other dysarthric speech tasks. This paper investigates the efficiency of pre-training strategies for SLU tasks on dysarthric speech. The designed SLU system consists of a TDNN acoustic model for feature encoding and a capsule network for intent and slot decoding. The acoustic model is pre-trained in two stages: initialization with a corpus of normal speech and finetuning on a mixture of dysarthric and normal speech. By introducing the intelligibility score as a metric of the impairment severity, this paper quantitatively analyzes the relation between generalization and pathology severity for dysarthric speech.

[36]  arXiv:2106.08321 [pdf, other]
Title: ADEPT: A Dataset for Evaluating Prosody Transfer
Comments: 5 pages, 1 figure, accepted to Interspeech 2021
Subjects: Audio and Speech Processing (eess.AS)

Text-to-speech is now able to achieve near-human naturalness and research focus has shifted to increasing expressivity. One popular method is to transfer the prosody from a reference speech sample. There have been considerable advances in using prosody transfer to generate more expressive speech, but the field lacks a clear definition of what successful prosody transfer means and a method for measuring it.
We introduce a dataset of prosodically-varied reference natural speech samples for evaluating prosody transfer. The samples include global variations reflecting emotion and interpersonal attitude, and local variations reflecting topical emphasis, propositional attitude, syntactic phrasing and marked tonicity. The corpus only includes prosodic variations that listeners are able to distinguish with reasonable accuracy, and we report these figures as a benchmark against which text-to-speech prosody transfer can be compared.
We conclude the paper with a demonstration of our proposed evaluation methodology, using the corpus to evaluate two text-to-speech models that perform prosody transfer.

[37]  arXiv:2106.08325 [pdf]
Title: Fuel-Economical Distributed Model Predictive Control for Heavy-Duty Truck Platoon
Comments: accepted for publication at the 24th IEEE Intelligent Transportation Systems Conference (ITSC 2021)
Subjects: Systems and Control (eess.SY)

This paper proposes a fuel-economical distributed model predictive control design (Eco-DMPC) for a homogenous heavy-duty truck platoon. The proposed control strategy integrates a fuel-optimal control strategy for the leader truck with a distributed formation control for the following trucks in the heavy-duty truck platoon. The fuel-optimal control strategy is implemented by a nonlinear model predictive control (NMPC) design with an instantaneous fuel consumption model. The proposed fuel-optimal control strategy utilizes the preview information of the preceding traffic to achieve the fuel-economical speed planning by avoiding energy-inefficient maneuvers, particularly under transient traffic conditions. The distributed formation control is designed with a serial distributed model predictive control (DMPC) strategy with guaranteed local and string stability. In the DMPC strategy, each following truck acquires the future predicted state information of its predecessor through vehicle connectivity and then applies local optimal control to maintain constant spacing. Simulation studies are conducted to investigate the fuel economy performance of the proposed control strategy and to validate the local and string stability of the platoon under a realistic traffic scenario. Compared with a human-operated platoon and a benchmark formation-controlled platoon, the proposed Eco-DMPC significantly improves the fuel economy and road utilization.

Cross-lists for Wed, 16 Jun 21

[38]  arXiv:2106.07577 (cross-list from cs.SD) [pdf, other]
Title: F-T-LSTM based Complex Network for Joint Acoustic Echo Cancellation and Speech Enhancement
Comments: Accepted by Interspeech 2021
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

With the increasing demand for audio communication and online conference, ensuring the robustness of Acoustic Echo Cancellation (AEC) under the complicated acoustic scenario including noise, reverberation and nonlinear distortion has become a top issue. Although there have been some traditional methods that consider nonlinear distortion, they are still inefficient for echo suppression and the performance will be attenuated when noise is present. In this paper, we present a real-time AEC approach using complex neural network to better modeling the important phase information and frequency-time-LSTMs (F-T-LSTM), which scan both frequency and time axis, for better temporal modeling. Moreover, we utilize modified SI-SNR as cost function to make the model to have better echo cancellation and noise suppression (NS) performance. With only 1.4M parameters, the proposed approach outperforms the AEC-challenge baseline by 0.27 in terms of Mean Opinion Score (MOS).

[39]  arXiv:2106.07699 (cross-list from cs.CL) [pdf, ps, other]
Title: Using heterogeneity in semi-supervised transcription hypotheses to improve code-switched speech recognition
Comments: 5 pages
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Modeling code-switched speech is an important problem in automatic speech recognition (ASR). Labeled code-switched data are rare, so monolingual data are often used to model code-switched speech. These monolingual data may be more closely matched to one of the languages in the code-switch pair. We show that such asymmetry can bias prediction toward the better-matched language and degrade overall model performance. To address this issue, we propose a semi-supervised approach for code-switched ASR. We consider the case of English-Mandarin code-switching, and the problem of using monolingual data to build bilingual "transcription models'' for annotation of unlabeled code-switched data. We first build multiple transcription models so that their individual predictions are variously biased toward either English or Mandarin. We then combine these biased transcriptions using confidence-based selection. This strategy generates a superior transcript for semi-supervised training, and obtains a 19% relative improvement compared to a semi-supervised system that relies on a transcription model built with only the best-matched monolingual data.

[40]  arXiv:2106.07708 (cross-list from cs.LG) [pdf]
Title: CathAI: Fully Automated Interpretation of Coronary Angiograms Using Neural Networks
Comments: 62 pages, 3 main figures, 2 main tables
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Coronary heart disease (CHD) is the leading cause of adult death in the United States and worldwide, and for which the coronary angiography procedure is the primary gateway for diagnosis and clinical management decisions. The standard-of-care for interpretation of coronary angiograms depends upon ad-hoc visual assessment by the physician operator. However, ad-hoc visual interpretation of angiograms is poorly reproducible, highly variable and bias prone. Here we show for the first time that fully-automated angiogram interpretation to estimate coronary artery stenosis is possible using a sequence of deep neural network algorithms. The algorithmic pipeline we developed--called CathAI--achieves state-of-the art performance across the sequence of tasks required to accomplish automated interpretation of unselected, real-world angiograms. CathAI (Algorithms 1-2) demonstrated positive predictive value, sensitivity and F1 score of >=90% to identify the projection angle overall and >=93% for left or right coronary artery angiogram detection, the primary anatomic structures of interest. To predict obstructive coronary artery stenosis (>=70% stenosis), CathAI (Algorithm 4) exhibited an area under the receiver operating characteristic curve (AUC) of 0.862 (95% CI: 0.843-0.880). When externally validated in a healthcare system in another country, CathAI AUC was 0.869 (95% CI: 0.830-0.907) to predict obstructive coronary artery stenosis. Our results demonstrate that multiple purpose-built neural networks can function in sequence to accomplish the complex series of tasks required for automated analysis of real-world angiograms. Deployment of CathAI may serve to increase standardization and reproducibility in coronary stenosis assessment, while providing a robust foundation to accomplish future tasks for algorithmic angiographic interpretation.

[41]  arXiv:2106.07716 (cross-list from cs.CL) [pdf, ps, other]
Title: Overcoming Domain Mismatch in Low Resource Sequence-to-Sequence ASR Models using Hybrid Generated Pseudotranscripts
Comments: 5 pages
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Sequence-to-sequence (seq2seq) models are competitive with hybrid models for automatic speech recognition (ASR) tasks when large amounts of training data are available. However, data sparsity and domain adaptation are more problematic for seq2seq models than their hybrid counterparts. We examine corpora of five languages from the IARPA MATERIAL program where the transcribed data is conversational telephone speech (CTS) and evaluation data is broadcast news (BN). We show that there is a sizable initial gap in such a data condition between hybrid and seq2seq models, and the hybrid model is able to further improve through the use of additional language model (LM) data. We use an additional set of untranscribed data primarily in the BN domain for semisupervised training. In semisupervised training, a seed model trained on transcribed data generates hypothesized transcripts for unlabeled domain-matched data for further training. By using a hybrid model with an expanded language model for pseudotranscription, we are able to improve our seq2seq model from an average word error rate (WER) of 66.7% across all five languages to 29.0% WER. While this puts the seq2seq model at a competitive operating point, hybrid models are still able to use additional LM data to maintain an advantage.

[42]  arXiv:2106.07732 (cross-list from cs.SD) [pdf, other]
Title: Learning Audio-Visual Dereverberation
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition. Prior work attempts to remove reverberation based on the audio modality only. Our idea is to learn to dereverberate speech from audio-visual observations. The visual environment surrounding a human speaker reveals important cues about the room geometry, materials, and speaker location, all of which influence the precise reverberation effects in the audio stream. We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene. In support of this new task, we develop a large-scale dataset that uses realistic acoustic renderings of speech in real-world 3D scans of homes offering a variety of room acoustics. Demonstrating our approach on both simulated and real imagery for speech enhancement, speech recognition, and speaker identification, we show it achieves state-of-the-art performance and substantially improves over traditional audio-only methods. Project page: this http URL

[43]  arXiv:2106.07734 (cross-list from cs.CL) [pdf, other]
Title: CoDERT: Distilling Encoder Representations with Co-learning for Transducer-based Speech Recognition
Comments: Accepted at InterSpeech 2021
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

We propose a simple yet effective method to compress an RNN-Transducer (RNN-T) through the well-known knowledge distillation paradigm. We show that the transducer's encoder outputs naturally have a high entropy and contain rich information about acoustically similar word-piece confusions. This rich information is suppressed when combined with the lower entropy decoder outputs to produce the joint network logits. Consequently, we introduce an auxiliary loss to distill the encoder logits from a teacher transducer's encoder, and explore training strategies where this encoder distillation works effectively. We find that tandem training of teacher and student encoders with an inplace encoder distillation outperforms the use of a pre-trained and static teacher transducer. We also report an interesting phenomenon we refer to as implicit distillation, that occurs when the teacher and student encoders share the same decoder. Our experiments show 5.37-8.4% relative word error rate reductions (WERR) on in-house test sets, and 5.05-6.18% relative WERRs on LibriSpeech test sets.

[44]  arXiv:2106.07736 (cross-list from math.OC) [pdf, ps, other]
Title: Unique sparse decomposition of low rank matrices
Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Signal Processing (eess.SP); Numerical Analysis (math.NA)

The problem of finding the unique low dimensional decomposition of a given matrix has been a fundamental and recurrent problem in many areas. In this paper, we study the problem of seeking a unique decomposition of a low rank matrix $Y\in \mathbb{R}^{p\times n}$ that admits a sparse representation. Specifically, we consider $Y = A X\in \mathbb{R}^{p\times n}$ where the matrix $A\in \mathbb{R}^{p\times r}$ has full column rank, with $r < \min\{n,p\}$, and the matrix $X\in \mathbb{R}^{r\times n}$ is element-wise sparse. We prove that this sparse decomposition of $Y$ can be uniquely identified, up to some intrinsic signed permutation. Our approach relies on solving a nonconvex optimization problem constrained over the unit sphere. Our geometric analysis for the nonconvex optimization landscape shows that any {\em strict} local solution is close to the ground truth solution, and can be recovered by a simple data-driven initialization followed with any second order descent algorithm. At last, we corroborate these theoretical results with numerical experiments.

[45]  arXiv:2106.07787 (cross-list from cs.SD) [pdf, other]
Title: Tracing Back Music Emotion Predictions to Sound Sources and Intuitive Perceptual Qualities
Comments: Sound and Music Computing Conference 2021
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

Music emotion recognition is an important task in MIR (Music Information Retrieval) research. Owing to factors like the subjective nature of the task and the variation of emotional cues between musical genres, there are still significant challenges in developing reliable and generalizable models. One important step towards better models would be to understand what a model is actually learning from the data and how the prediction for a particular input is made. In previous work, we have shown how to derive explanations of model predictions in terms of spectrogram image segments that connect to the high-level emotion prediction via a layer of easily interpretable perceptual features. However, that scheme lacks intuitive musical comprehensibility at the spectrogram level. In the present work, we bridge this gap by merging audioLIME -- a source-separation based explainer -- with mid-level perceptual features, thus forming an intuitive connection chain between the input audio and the output emotion predictions. We demonstrate the usefulness of this method by applying it to debug a biased emotion prediction model.

[46]  arXiv:2106.07803 (cross-list from cs.LG) [pdf, other]
Title: SynthASR: Unlocking Synthetic Data for Speech Recognition
Comments: Accepted to Interspeech 2021
Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

End-to-end (E2E) automatic speech recognition (ASR) models have recently demonstrated superior performance over the traditional hybrid ASR models. Training an E2E ASR model requires a large amount of data which is not only expensive but may also raise dependency on production data. At the same time, synthetic speech generated by the state-of-the-art text-to-speech (TTS) engines has advanced to near-human naturalness. In this work, we propose to utilize synthetic speech for ASR training (SynthASR) in applications where data is sparse or hard to get for ASR model training. In addition, we apply continual learning with a novel multi-stage training strategy to address catastrophic forgetting, achieved by a mix of weighted multi-style training, data augmentation, encoder freezing, and parameter regularization. In our experiments conducted on in-house datasets for a new application of recognizing medication names, training ASR RNN-T models with synthetic audio via the proposed multi-stage training improved the recognition performance on new application by more than 65% relative, without degradation on existing general applications. Our observations show that SynthASR holds great promise in training the state-of-the-art large-scale E2E ASR models for new applications while reducing the costs and dependency on production data.

[47]  arXiv:2106.07843 (cross-list from cs.SD) [pdf, other]
Title: Teacher-Student MixIT for Unsupervised and Semi-supervised Speech Separation
Comments: Accepted to Interspeech 2021
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

In this paper, we introduce a novel semi-supervised learning framework for end-to-end speech separation. The proposed method first uses mixtures of unseparated sources and the mixture invariant training (MixIT) criterion to train a teacher model. The teacher model then estimates separated sources that are used to train a student model with standard permutation invariant training (PIT). The student model can be fine-tuned with supervised data, i.e., paired artificial mixtures and clean speech sources, and further improved via model distillation. Experiments with single and multi channel mixtures show that the teacher-student training resolves the over-separation problem observed in the original MixIT method. Further, the semisupervised performance is comparable to a fully-supervised separation system trained using ten times the amount of supervised data.

[48]  arXiv:2106.07856 (cross-list from cs.CV) [pdf, other]
Title: A Hybrid mmWave and Camera System for Long-Range Depth Imaging
Subjects: Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI); Robotics (cs.RO); Signal Processing (eess.SP)

mmWave radars offer excellent depth resolution owing to their high bandwidth at mmWave radio frequencies. Yet, they suffer intrinsically from poor angular resolution, that is an order-of-magnitude worse than camera systems, and are therefore not a capable 3-D imaging solution in isolation. We propose Metamoran, a system that combines the complimentary strengths of radar and camera systems to obtain depth images at high azimuthal resolutions at distances of several tens of meters with high accuracy, all from a single fixed vantage point. Metamoran enables rich long-range depth imaging outdoors with applications to roadside safety infrastructure, surveillance and wide-area mapping. Our key insight is to use the high azimuth resolution from cameras using computer vision techniques, including image segmentation and monocular depth estimation, to obtain object shapes and use these as priors for our novel specular beamforming algorithm. We also design this algorithm to work in cluttered environments with weak reflections and in partially occluded scenarios. We perform a detailed evaluation of Metamoran's depth imaging and sensing capabilities in 200 diverse scenes at a major U.S. city. Our evaluation shows that Metamoran estimates the depth of an object up to 60~m away with a median error of 28~cm, an improvement of 13$\times$ compared to a naive radar+camera baseline and 23$\times$ compared to monocular depth estimation.

[49]  arXiv:2106.07868 (cross-list from cs.LG) [pdf, other]
Title: Voting for the right answer: Adversarial defense for speaker verification
Comments: Accepted by Interspeech 2021. Code is available at this https URL
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Automatic speaker verification (ASV) is a well developed technology for biometric identification, and has been ubiquitous implemented in security-critic applications, such as banking and access control. However, previous works have shown that ASV is under the radar of adversarial attacks, which are very similar to their original counterparts from human's perception, yet will manipulate the ASV render wrong prediction. Due to the very late emergence of adversarial attacks for ASV, effective countermeasures against them are limited. Given that the security of ASV is of high priority, in this work, we propose the idea of "voting for the right answer" to prevent risky decisions of ASV in blind spot areas, by employing random sampling and voting. Experimental results show that our proposed method improves the robustness against both the limited-knowledge attackers by pulling the adversarial samples out of the blind spots, and the perfect-knowledge attackers by introducing randomness and increasing the attackers' budgets. The code for reproducing main results is available at https://github.com/thuhcsi/adsv_voting.

[50]  arXiv:2106.07874 (cross-list from cs.SD) [pdf]
Title: Towards the Objective Speech Assessment of Smoking Status based on Voice Features: A Review of the Literature
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

In smoking cessation clinical research and practice, objective validation of self-reported smoking status is crucial for ensuring the reliability of the primary outcome, that is, smoking abstinence. Speech signals convey important information about a speaker, such as age, gender, body size, emotional state, and health state. We investigated (1) if smoking could measurably alter voice features, (2) if smoking cessation could lead to changes in voice, and therefore (3) if the voice-based smoking status assessment has the potential to be used as an objective smoking cessation validation method.

[51]  arXiv:2106.07922 (cross-list from cs.CL) [pdf, other]
Title: An Automated Quality Evaluation Framework of Psychotherapy Conversations with Local Quality Estimates
Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

Computational approaches for assessing the quality of conversation-based psychotherapy, such as Cognitive Behavioral Therapy (CBT) and Motivational Interviewing (MI), have been developed recently to support quality assurance and clinical training. However, due to the long session lengths and limited modeling resources, computational methods largely rely on frequency-based lexical features or distribution of dialogue acts. In this work, we propose a hierarchical framework to automatically evaluate the quality of a CBT interaction. We divide each psychotherapy session into conversation segments and input those into a BERT-based model to produce segment embeddings. We first fine-tune BERT for predicting segment-level (local) quality scores and then use segment embeddings as lower-level input to a Bidirectional LSTM-based neural network to predict session-level (global) quality estimates. In particular, the segment-level quality scores are initialized with the session-level scores and we model the global quality as a function of the local quality scores to achieve the accurate segment-level quality estimates. These estimated segment-level scores benefit theBERT fine-tuning and in learning better segment embeddings. We evaluate the proposed framework on data drawn from real-world CBT clinical session recordings to predict multiple session-level behavior codes. The results indicate that our approach leads to improved evaluation accuracy for most codes in both regression and classification tasks.

[52]  arXiv:2106.07938 (cross-list from cs.IT) [pdf, ps, other]
Title: User Pairing and Power Allocation for IRS-Assisted NOMA Systems with Imperfect Phase Compensation
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)

In this letter, we analyze the performance of the intelligent reflecting surface (IRS) assisted downlink non-orthogonal multiple access (NOMA) systems in the presence of imperfect phase compensation. We derive an upper bound on the imperfect phase compensation to achieve minimum required data rates for each user. Using this bound, we propose an adaptive user pairing algorithm to maximize the network throughput. We then derive bounds on the power allocation factors and propose power allocation algorithms for the paired users to achieve the maximum sum rate or ensure fairness. Through extensive simulations, we show that the proposed algorithms significantly outperform the state-of-the-art algorithms.

[53]  arXiv:2106.07976 (cross-list from cs.LG) [pdf, other]
Title: Federated Learning for Internet of Things: A Federated Learning Framework for On-device Anomaly Data Detection
Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)

Federated learning can be a promising solution for enabling IoT cybersecurity (i.e., anomaly detection in the IoT environment) while preserving data privacy and mitigating the high communication/storage overhead (e.g., high-frequency data from time-series sensors) of centralized over-the-cloud approaches. In this paper, to further push forward this direction with a comprehensive study in both algorithm and system design, we build FedIoT platform that contains a synthesized dataset using N-BaIoT, FedDetect algorithm, and a system design for IoT devices. Furthermore, the proposed FedDetect learning framework improves the performance by utilizing an adaptive optimizer (e.g., Adam) and a cross-round learning rate scheduler. In a network of realistic IoT devices (Raspberry PI), we evaluate FedIoT platform and FedDetect algorithm in both model and system performance. Our results demonstrate the efficacy of federated learning in detecting a large range of attack types. The system efficiency analysis indicates that both end-to-end training time and memory cost are affordable and promising for resource-constrained IoT devices. The source code is publicly available.

[54]  arXiv:2106.07978 (cross-list from physics.med-ph) [pdf, other]
Title: Pixel-reassignment in Ultrasound Imaging
Subjects: Medical Physics (physics.med-ph); Image and Video Processing (eess.IV)

We present an adaptation of the pixel-reassignment technique from confocal fluorescent microscopy to coherent ultrasound imaging. The method, Ultrasound Pixel-Reassignment (UPR), provides a resolution and signal to noise (SNR) improvement in ultrasound imaging by computationally reassigning off-focus signals acquired using traditional plane-wave compounding ultrasonography. We theoretically analyze the analogy between the optical and ultrasound implementations of pixel reassignment, and experimentally evaluate the imaging quality on tissue-mimicking acoustic phantoms. We demonstrate that UPR provides a $25\%$ resolution improvement and a $3dB$ SNR improvement in in-vitro scans, without any change in hardware or acquisition scheme.

[55]  arXiv:2106.08004 (cross-list from cs.SD) [pdf, other]
Title: Adaptive Margin Circle Loss for Speaker Verification
Authors: Runqiu Xiao
Comments: Accepted by Interspeech 2021
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

Deep-Neural-Network (DNN) based speaker verification sys-tems use the angular softmax loss with margin penalties toenhance the intra-class compactness of speaker embeddings,which achieved remarkable performance. In this paper, we pro-pose a novel angular loss function called adaptive margin cir-cle loss for speaker verification. The stage-based margin andchunk-based margin are applied to improve the angular discrim-ination of circle loss on the training set. The analysis on gradi-ents shows that, compared with the previous angular loss likeAdditive Margin Softmax(Am-Softmax), circle loss has flexi-ble optimization and definite convergence status. Experimentsare carried out on the Voxceleb and SITW. By applying adap-tive margin circle loss, our best system achieves 1.31%EER onVoxceleb1 and 2.13% on SITW core-core.

[56]  arXiv:2106.08011 (cross-list from cs.IT) [pdf, other]
Title: Over-the-Air Decentralized Federated Learning
Comments: Accepted by ISIT 2021
Subjects: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)

In this paper, we consider decentralized federated learning (FL) over wireless networks, where over-the-air computation (AirComp) is adopted to facilitate the local model consensus in a device-to-device (D2D) communication manner. However, the AirComp-based consensus phase brings the additive noise in each algorithm iterate and the consensus needs to be robust to wireless network topology changes, which introduce a coupled and novel challenge of establishing the convergence for wireless decentralized FL algorithm. To facilitate consensus phase, we propose an AirComp-based DSGD with gradient tracking and variance reduction (DSGT-VR) algorithm, where both precoding and decoding strategies are developed for D2D communication. Furthermore, we prove that the proposed algorithm converges linearly and establish the optimality gap for strongly convex and smooth loss functions, taking into account the channel fading and noise. The theoretical result shows that the additional error bound in the optimality gap depends on the number of devices. Extensive simulations verify the theoretical results and show that the proposed algorithm outperforms other benchmark decentralized FL algorithms over wireless networks.

[57]  arXiv:2106.08088 (cross-list from cs.IT) [pdf, other]
Title: Heterogeneous Multi-sensor Fusion with Random Finite Set Multi-object Densities
Authors: Wei Yi, Lei Chai
Subjects: Information Theory (cs.IT); Systems and Control (eess.SY)

This paper addresses the density based multi-sensor cooperative fusion using random finite set (RFS) type multi-object densities (MODs). Existing fusion methods use scalar weights to characterize the relative information confidence among the local MODs, and in this way the portion of contribution of each local MOD to the fused global MOD can be tuned via adjusting these weights. Our analysis shows that the fusion mechanism of using a scalar coefficient can be oversimplified for practical scenarios, as the information confidence of an MOD is complex and usually space-varying due to the imperfection of sensor ability and the various impacts from surveillance environment. Consequently, severe fusion performance degradation can be observed when these scalar weights fail to reflect the actual situation. We make two contributions towards addressing this problem. Firstly, we propose a novel heterogeneous fusion method to perform the information averaging among local RFS MODs. By factorizing each local MODs into a number of smaller size sub-MODs, it can transform the original complicated fusion problem into a much easier parallelizable multi-cluster fusion problem. Secondly, as the proposed fusion strategy is a general procedure without any particular model assumptions, we further derive the detailed heterogeneous fusion equations, with centralized network architecture, for both the probability hypothesis density (PHD) filter and the multi-Bernoulli (MB) filter. The Gaussian mixture implementations of the proposed fusion algorithms are also presented. Various numerical experiments are designed to demonstrate the efficacy of the proposed fusion methods.

[58]  arXiv:2106.08104 (cross-list from cs.MM) [pdf, other]
Title: Detect and remove watermark in deep neural networks via generative adversarial networks
Subjects: Multimedia (cs.MM); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

Deep neural networks (DNN) have achieved remarkable performance in various fields. However, training a DNN model from scratch requires a lot of computing resources and training data. It is difficult for most individual users to obtain such computing resources and training data. Model copyright infringement is an emerging problem in recent years. For instance, pre-trained models may be stolen or abuse by illegal users without the authorization of the model owner. Recently, many works on protecting the intellectual property of DNN models have been proposed. In these works, embedding watermarks into DNN based on backdoor is one of the widely used methods. However, when the DNN model is stolen, the backdoor-based watermark may face the risk of being detected and removed by an adversary. In this paper, we propose a scheme to detect and remove watermark in deep neural networks via generative adversarial networks (GAN). We demonstrate that the backdoor-based DNN watermarks are vulnerable to the proposed GAN-based watermark removal attack. The proposed attack method includes two phases. In the first phase, we use the GAN and few clean images to detect and reverse the watermark in the DNN model. In the second phase, we fine-tune the watermarked DNN based on the reversed backdoor images. Experimental evaluations on the MNIST and CIFAR10 datasets demonstrate that, the proposed method can effectively remove about 98% of the watermark in DNN models, as the watermark retention rate reduces from 100% to less than 2% after applying the proposed attack. In the meantime, the proposed attack hardly affects the model's performance. The test accuracy of the watermarked DNN on the MNIST and the CIFAR10 datasets drops by less than 1% and 3%, respectively.

[59]  arXiv:2106.08164 (cross-list from cs.RO) [pdf]
Title: Task Allocation and Coordinated Motion Planning for Autonomous Multi-Robot Optical Inspection Systems
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)

Autonomous multi-robot optical inspection systems are increasingly applied for obtaining inline measurements in process monitoring and quality control. Numerous methods for path planning and robotic coordination have been developed for static and dynamic environments and applied to different fields. However, these approaches may not work for the autonomous multi-robot optical inspection system due to fast computation requirements of inline optimization, unique characteristics on robotic end-effector orientations, and complex large-scale free-form product surfaces. This paper proposes a novel task allocation methodology for coordinated motion planning of multi-robot inspection. Specifically, (1) a local robust inspection task allocation is proposed to achieve efficient and well-balanced measurement assignment among robots; (2) collision-free path planning and coordinated motion planning are developed via dynamic searching in robotic coordinate space and perturbation of probe poses or local paths in the conflicting robots. A case study shows that the proposed approach can mitigate the risk of collisions between robots and environments, resolve conflicts among robots, and reduce the inspection cycle time significantly and consistently.

[60]  arXiv:2106.08165 (cross-list from cs.IT) [pdf, ps, other]
Title: QoE Driven VR 360 Video Massive MIMO Transmission
Comments: Acceptede by IEEE transactions on wireless communications
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)

Massive multiple-input and multiple-output (MIMO) enables ultra-high throughput and low latency for tile-based adaptive virtual reality (VR) 360 video transmission in wireless network. In this paper, we consider a massive MIMO system where multiple users in a single-cell theater watch an identical VR 360 video. Based on tile prediction, base station (BS) deliveries the tiles in predicted field of view (FoV) to users. By introducing practical supplementary transmission for missing tiles and unacceptable VR sickness, we propose the first stable transmission scheme for VR video. we formulate an integer non-linear programming (INLP) problem to maximize users' average quality of experience (QoE) score. Moreover, we derive the achievable spectral efficiency (SE) expression of predictive tile groups and the approximately achievable SE expression of missing tile groups, respectively. Analytically, the overall throughput is related to the number of tile groups and the length of pilot sequences. By exploiting the relationship between the structure of viewport tiles and SE expression, we propose a multi-lattice multi-stream grouping method aimed at improving the overall throughput for VR video transmission. Moreover, we analyze the relationship between QoE objective and number of predictive tile. We transform the original INLP problem into an integer linear programming problem by setting the predictive tiles groups as some constants. With variable relaxation and recovery, we obtain the optimal average QoE. Extensive simulation results validate that the proposed algorithm effectively improves QoE.

[61]  arXiv:2106.08177 (cross-list from cs.CR) [pdf]
Title: The Reliability and Acceptance of Biometric System in Bangladesh: Users Perspective
Comments: 7 pages, 4 figures, Published with International Journal of Computer Trends and Technology (IJCTT)
Journal-ref: International Journal of Computer Trends and Technology, 69(6), 15-21, June 2021
Subjects: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Systems and Control (eess.SY)

Biometric systems are the latest technologies of unique identification. People all over the world prefer to use this unique identification technology for their authentication security. The goal of this research is to evaluate the biometric systems based on system reliability and user satisfaction. As technology fully depends on personal data, so in terms of the quality and reliability of biometric systems, user satisfaction is a principal factor. To walk with the digital era, it is extremely important to assess users' concerns about data security as the systems are conducted the authentication by analyzing users' personal data. The study shows that users are satisfied by using biometric systems rather than other security systems. Besides, hardware failure is a big issue faced by biometric systems users. Finally, a matrix is generated to compare the performance of popular biometric systems from the users' opinions. As system reliability and user satisfaction are the focused issue of this research, biometric service providers can use these phenomena to find what aspect of improvement they need for their services. Also, this study can be a great visualizer for Bangladeshi users, so that they can easily realize which biometric system they have to choose.

[62]  arXiv:2106.08218 (cross-list from physics.med-ph) [pdf]
Title: Accurate Dose Measurements Using Cherenkov Polarization Imaging
Subjects: Medical Physics (physics.med-ph); Image and Video Processing (eess.IV); Instrumentation and Detectors (physics.ins-det)

Purpose: Cherenkov radiation carries the potential of direct in-water dose measurements, but its precision is currently limited by a strong anisotropy. Taking advantage of polarization imaging, this work proposes a new approach for high accuracy Cherenkov dose measurements. Methods: Cherenkov produced in a 15x15x20 cm^3 water tank is imaged with a cooled CCD camera from four polarizer transmission axes [0{\deg}, 45{\deg}, 90{\deg}, 135{\deg}]. The water tank is positioned at the isocenter of a 5x5 cm^2, 6 MV photon beam. Using Malus' law, the polarized portion of the signal is extracted. Corrections are applied to the polarized signal following azimuthal and polar Cherenkov angular distributions extracted from Monte Carlo simulations. Percent depth dose and beam profiles are measured and compared with the prediction from a treatment planning system (TPS). Results: Corrected polarized signals on the central axis reduced deviations at depth from 20% to 0.8\pm1%. For the profile measurement, differences between the corrected polarized signal and the TPS calculations are 1\pm3% and 8\pm3% on the central axis and penumbra regions respectively. 29\pm1% of the Cherenkov signal was found to be polarized. Conclusions: This work proposes a novel polarization imaging approach enabling high precision water-based Cherenkov dose measurements. The method allows correction of the Cherenkov anisotropy within 3% on the beam central axis and in depth.

[63]  arXiv:2106.08233 (cross-list from cs.CV) [pdf, other]
Title: Spot the Difference: Topological Anomaly Detection via Geometric Alignment
Comments: Preprint, under review
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

Geometric alignment appears in a variety of applications, ranging from domain adaptation, optimal transport, and normalizing flows in machine learning; optical flow and learned augmentation in computer vision and deformable registration within biomedical imaging. A recurring challenge is the alignment of domains whose topology is not the same; a problem that is routinely ignored, potentially introducing bias in downstream analysis. As a first step towards solving such alignment problems, we propose an unsupervised topological difference detection algorithm. The model is based on a conditional variational auto-encoder and detects topological anomalies with regards to a reference alongside the registration step. We consider both a) topological changes in the image under spatial variation and b) unexpected transformations. Our approach is validated on a proxy task of unsupervised anomaly detection in images.

[64]  arXiv:2106.08256 (cross-list from cond-mat.mtrl-sci) [pdf, other]
Title: Phase retrieval from 4-dimensional electron diffraction datasets
Comments: Accepted conference paper of IEEE ICIP 2021
Subjects: Materials Science (cond-mat.mtrl-sci); Image and Video Processing (eess.IV)

We present a computational imaging mode for large scale electron microscopy data, which retrieves a complex wave from noisy/sparse intensity recordings using a deep learning approach and subsequently reconstructs an image of the specimen from the Convolutional Neural Network (CNN) predicted exit waves. We demonstrate that an appropriate forward model in combination with open data frameworks can be used to generate large synthetic datasets for training. In combination with augmenting the data with Poisson noise corresponding to varying dose-values, we effectively eliminate overfitting issues. The U-NET based architecture of the CNN is adapted to the task at hand and performs well while maintaining a relatively small size and fast performance. The validity of the approach is confirmed by comparing the reconstruction to well-established methods using simulated, as well as real electron microscopy data. The proposed method is shown to be effective particularly in the low dose range, evident by strong suppression of noise, good spatial resolution, and sensitivity to different atom types, enabling the simultaneous visualisation of light and heavy elements and making different atomic species distinguishable. Since the method acts on a very local scale and is comparatively fast it bears the potential to be used for near-real-time reconstruction during data acquisition.

[65]  arXiv:2106.08285 (cross-list from cs.CV) [pdf, other]
Title: Multi-StyleGAN: Towards Image-Based Simulation of Time-Lapse Live-Cell Microscopy
Comments: accepted to MICCAI 2021. (Tim Prangemeier and Christoph Reich --- both authors contributed equally)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)

Time-lapse fluorescent microscopy (TLFM) combined with predictive mathematical modelling is a powerful tool to study the inherently dynamic processes of life on the single-cell level. Such experiments are costly, complex and labour intensive. A complimentary approach and a step towards completely in silico experiments, is to synthesise the imagery itself. Here, we propose Multi-StyleGAN as a descriptive approach to simulate time-lapse fluorescence microscopy imagery of living cells, based on a past experiment. This novel generative adversarial network synthesises a multi-domain sequence of consecutive timesteps. We showcase Multi-StyleGAN on imagery of multiple live yeast cells in microstructured environments and train on a dataset recorded in our laboratory. The simulation captures underlying biophysical factors and time dependencies, such as cell morphology, growth, physical interactions, as well as the intensity of a fluorescent reporter protein. An immediate application is to generate additional training and validation data for feature extraction algorithms or to aid and expedite development of advanced experimental techniques such as online monitoring or control of cells.
Code and dataset is available at https://git.rwth-aachen.de/bcs/projects/tp/multi-stylegan.

[66]  arXiv:2106.08318 (cross-list from cs.CV) [pdf, other]
Title: Gradient Forward-Propagation for Large-Scale Temporal Video Modelling
Comments: Accepted to CVPR 2021. arXiv admin note: text overlap with arXiv:2001.06232
Subjects: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

How can neural networks be trained on large-volume temporal data efficiently? To compute the gradients required to update parameters, backpropagation blocks computations until the forward and backward passes are completed. For temporal signals, this introduces high latency and hinders real-time learning. It also creates a coupling between consecutive layers, which limits model parallelism and increases memory consumption. In this paper, we build upon Sideways, which avoids blocking by propagating approximate gradients forward in time, and we propose mechanisms for temporal integration of information based on different variants of skip connections. We also show how to decouple computation and delegate individual neural modules to different devices, allowing distributed and parallel training. The proposed Skip-Sideways achieves low latency training, model parallelism, and, importantly, is capable of extracting temporal features, leading to more stable training and improved performance on real-world action recognition video datasets such as HMDB51, UCF101, and the large-scale Kinetics-600. Finally, we also show that models trained with Skip-Sideways generate better future frames than Sideways models, and hence they can better utilize motion cues.

Replacements for Wed, 16 Jun 21

[67]  arXiv:1909.00508 (replaced) [pdf, other]
Title: Two-Stage Electricity Markets with Renewable Energy Integration: Market Mechanisms and Equilibrium Analysis
Subjects: Computer Science and Game Theory (cs.GT); General Economics (econ.GN); Systems and Control (eess.SY)
[68]  arXiv:1912.08421 (replaced) [pdf, other]
Title: Learning to Prevent Leakage: Privacy-Preserving Inference in the Mobile Cloud
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
[69]  arXiv:2004.11468 (replaced) [pdf, other]
Title: How to find a unicorn: a novel model-free, unsupervised anomaly detection method for time series
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
[70]  arXiv:2005.08898 (replaced) [pdf, ps, other]
Title: Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent
Comments: Accepted to Journal of Machine Learning Research
Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP); Optimization and Control (math.OC); Machine Learning (stat.ML)
[71]  arXiv:2008.07428 (replaced) [pdf, other]
Title: Fast decentralized non-convex finite-sum optimization with recursive variance reduction
Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY); Machine Learning (stat.ML)
[72]  arXiv:2010.04597 (replaced) [pdf, other]
Title: Computing Dynamic User Equilibrium on Large-Scale Networks Without Knowing Global Parameters
Comments: This paper replaces and extends the previous work arXiv:1810.00777. The paper is accepted for publication at Networks and Spatial Economics
Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
[73]  arXiv:2010.11066 (replaced) [pdf, other]
Title: Contextualized Attention-based Knowledge Transfer for Spoken Conversational Question Answering
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[74]  arXiv:2010.12341 (replaced) [pdf, other]
Title: Abstracting the Traffic of Nonlinear Event-Triggered Control Systems
Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
[75]  arXiv:2011.00472 (replaced) [pdf, other]
Title: Optimal minimal-contact routing of randomly arriving agents through connected networks
Subjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
[76]  arXiv:2011.09284 (replaced) [pdf, other]
Title: 3D imaging from multipath temporal echoes
Comments: Main document: 5 pages, 3 figures. Supplementary document: 8 pages, 7 figures. Supplementary videos can be accessed in the following link: this https URL
Journal-ref: Phys. Rev. Lett. 126, 174301 (2021)
Subjects: Image and Video Processing (eess.IV); Applied Physics (physics.app-ph)
[77]  arXiv:2011.10538 (replaced) [pdf, other]
Title: Improving RNN-T ASR Accuracy Using Context Audio
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[78]  arXiv:2011.12690 (replaced) [pdf, other]
Title: DeepKoCo: Efficient latent planning with a robust Koopman representation
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Systems and Control (eess.SY)
[79]  arXiv:2012.00952 (replaced) [pdf, other]
Title: Mechanism Design for Demand Management in Energy Communities
Subjects: Computer Science and Game Theory (cs.GT); Systems and Control (eess.SY)
[80]  arXiv:2012.04262 (replaced) [pdf, other]
Title: Overcomplete Representations Against Adversarial Videos
Comments: Accepted at IEEE International Conference on Image Processing (ICIP) 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[81]  arXiv:2012.08388 (replaced) [pdf, other]
Title: Dynamic driving and routing games for autonomous vehicles on networks: A mean field game approach
Comments: 32 pages, 13 figures
Journal-ref: Transportation Research Part C: Emerging Technologies, 128, 103189 (2021)
Subjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
[82]  arXiv:2012.10018 (replaced) [pdf, ps, other]
Title: NeurST: Neural Speech Translation Toolkit
Comments: Accepted by ACL 2021 (system demonstration)
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[83]  arXiv:2101.09451 (replaced) [pdf, other]
Title: Error Diffusion Halftoning Against Adversarial Examples
Comments: Accepted at IEEE International Conference on Image Processing (ICIP) 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[84]  arXiv:2102.09914 (replaced) [pdf, other]
Title: Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input
Comments: 4 pages
Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[85]  arXiv:2102.12525 (replaced) [pdf, other]
Title: Prior Image-Constrained Reconstruction using Style-Based Generative Models
Comments: Accepted for publication at the International Conference on Machine Learning (ICML) 2021
Subjects: Image and Video Processing (eess.IV)
[86]  arXiv:2103.05541 (replaced) [pdf, ps, other]
Title: Constrained Contextual Bandit Learning for Adaptive Radar Waveform Selection
Comments: 16 pages, 9 figures. arXiv admin note: text overlap with arXiv:2010.15698
Subjects: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
[87]  arXiv:2103.16858 (replaced) [pdf, other]
Title: SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification
Comments: Submitted to Interspeech 2021
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[88]  arXiv:2104.00769 (replaced) [pdf, other]
Title: Keyword Transformer: A Self-Attention Model for Keyword Spotting
Comments: Proceedings of INTERSPEECH
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[89]  arXiv:2104.01271 (replaced) [pdf, other]
Title: PATE-AAE: Incorporating Adversarial Autoencoder into Private Aggregation of Teacher Ensembles for Spoken Command Classification
Comments: Accepted to Interspeech 2021
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Audio and Speech Processing (eess.AS)
[90]  arXiv:2104.01497 (replaced) [pdf, other]
Title: Hi-Fi Multi-Speaker English TTS Dataset
Subjects: Audio and Speech Processing (eess.AS)
[91]  arXiv:2104.02207 (replaced) [pdf, other]
Title: Dissecting User-Perceived Latency of On-Device E2E Speech Recognition
Comments: Proc. of Interspeech 2021
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[92]  arXiv:2104.02397 (replaced) [pdf, other]
Title: ProsoBeast Prosody Annotation Tool
Comments: Accepted at Interspeech 2021
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
[93]  arXiv:2104.02469 (replaced) [pdf, other]
Title: Speaker Diarization using Two-pass Leave-One-Out Gaussian PLDA Clustering of DNN Embeddings
Comments: 5 pages, 2 figures, accepted at INTERSPEECH 2021
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Signal Processing (eess.SP)
[94]  arXiv:2104.02518 (replaced) [pdf, other]
Title: An Initial Investigation for Detecting Partially Spoofed Audio
Comments: INTERSPEECH 2021
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[95]  arXiv:2104.06104 (replaced) [pdf, ps, other]
Title: Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept
Comments: accepted at Interspeech2021
Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[96]  arXiv:2104.06346 (replaced) [pdf, other]
Title: A Distributed Mixed-Integer Framework to Stochastic Optimal Microgrid Control
Subjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
[97]  arXiv:2104.10611 (replaced) [pdf, other]
Title: Programmable 3D snapshot microscopy with Fourier convolutional networks
Comments: Make zebrafish Types A,B,C,D more clear
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[98]  arXiv:2104.13970 (replaced) [pdf, other]
Title: Personalized Keyphrase Detection using Speaker and Environment Information
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[99]  arXiv:2105.05752 (replaced) [pdf, other]
Title: Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders
Comments: ACL 2021
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[100]  arXiv:2106.00167 (replaced) [pdf, other]
Title: Regularization by Adversarial Learning for Ultrasound Elasticity Imaging
Subjects: Image and Video Processing (eess.IV); Signal Processing (eess.SP)
[101]  arXiv:2106.00644 (replaced) [pdf, other]
Title: A normal form for grid forming power grid components
Subjects: Adaptation and Self-Organizing Systems (nlin.AO); Systems and Control (eess.SY)
[102]  arXiv:2106.02182 (replaced) [pdf, other]
Title: Self-supervised Dialogue Learning for Spoken Conversational Question Answering
Comments: To Appear Interspeech 2021
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[103]  arXiv:2106.04312 (replaced) [pdf, other]
Title: Speech BERT Embedding For Improving Prosody in Neural TTS
Journal-ref: ICASSP 2021
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[104]  arXiv:2106.04748 (replaced) [pdf, other]
Title: Online Optimization in Games via Control Theory: Connecting Regret, Passivity and Poincaré Recurrence
Comments: In ICML 2021
Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Systems and Control (eess.SY); Dynamical Systems (math.DS)
[105]  arXiv:2106.05564 (replaced) [pdf, other]
Title: FRI-TEM: Time Encoding Sampling of Finite-Rate-of-Innovation Signals
Comments: 11 pages, 9 figures
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
[106]  arXiv:2106.06759 (replaced) [pdf, ps, other]
Title: AI Enlightens Wireless Communication: Analyses, Solutions and Opportunities on CSI Feedback
Subjects: Signal Processing (eess.SP)
[107]  arXiv:2106.06922 (replaced) [src]
Title: Cross-sentence Neural Language Models for Conversational Speech Recognition
Comments: The wordings and organizations of the draft still have room for improvement
Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[108]  arXiv:2106.06971 (replaced) [pdf, other]
Title: NLHD: A Pixel-Level Non-Local Retinex Model for Low-Light Image Enhancement
Comments: 14 pages, 11 figures
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[109]  arXiv:2106.07417 (replaced) [pdf, other]
Title: Online Estimation of Resource Overload Risk in 5G Multi-Tenancy Network
Comments: To appear at ESREL 2021
Subjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
[ total of 109 entries: 1-109 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, eess, recent, 2106, contact, help  (Access key information)