We gratefully acknowledge support from
the Simons Foundation and member institutions.

Electrical Engineering and Systems Science

New submissions

[ total of 183 entries: 1-183 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Tue, 28 Jun 22

[1]  arXiv:2206.12407 [pdf]
Title: Independent evaluation of state-of-the-art deep networks for mammography
Comments: 17 pages, 8 figures, 4 tables
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)

Deep neural models have shown remarkable performance in image recognition tasks, whenever large datasets of labeled images are available. The largest datasets in radiology are available for screening mammography. Recent reports, including in high impact journals, document performance of deep models at or above that of trained radiologists. What is not yet known is whether performance of these trained models is robust and replicates across datasets. Here we evaluate performance of five published state-of-the-art models on four publicly available mammography datasets. The limited size of public datasets precludes retraining the model and so we are limited to evaluate those models that have been made available with pre-trained parameters. Where test data was available, we replicated published results. However, the trained models performed poorly on out-of-sample data, except when based on all four standard views of a mammographic exam. We conclude that future progress will depend on a concerted effort to make more diverse and larger mammography datasets publicly available. Meanwhile, results that are not accompanied by a release of trained models for independent validation should be judged cautiously.

[2]  arXiv:2206.12417 [pdf, other]
Title: Deep embedded clustering algorithm for clustering PACS repositories
Journal-ref: Proceedings of the 2021 IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS)
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Creating large datasets of medical radiology images from several sources can be challenging because of the differences in the acquisition and storage standards. One possible way of controlling and/or assessing the image selection process is through medical image clustering. This, however, requires an efficient method for learning latent image representations. In this paper, we tackle the problem of fully-unsupervised clustering of medical images using pixel data only. We test the performance of several contemporary approaches, built on top of a convolutional autoencoder (CAE) - convolutional deep embedded clustering (CDEC) and convolutional improved deep embedded clustering (CIDEC) - and three approaches based on preset feature extraction - histogram of oriented gradients (HOG), local binary pattern (LBP) and principal component analysis (PCA). CDEC and CIDEC are end-to-end clustering solutions, involving simultaneous learning of latent representations and clustering assignments, whereas the remaining approaches rely on k-means clustering from fixed embeddings. We train the models on 30,000 images, and test them using a separate test set consisting of 8,000 images. We sampled the data from the PACS repository archive of the Clinical Hospital Centre Rijeka. For evaluation, we use silhouette score, homogeneity score and normalised mutual information (NMI) on two target parameters, closely associated with commonly occurring DICOM tags - Modality and anatomical region (adjusted BodyPartExamined tag). CIDEC attains an NMI score of 0.473 with respect to anatomical region, and CDEC attains an NMI score of 0.645 with respect to the tag Modality - both outperforming other commonly used feature descriptors.

[3]  arXiv:2206.12440 [pdf, other]
Title: Undetectable GPS-Spoofing Attack on Time Series Phasor Measurement Unit Data
Subjects: Systems and Control (eess.SY)

The Phasor Measurement Unit (PMU) is an important metering device for smart grid. Like any other Intelligent Electronic Device (IED), PMUs are prone to various types of cyberattacks. However, one form of attack is unique to the PMU, the GPS-spoofing attack, where the time and /or the one second pulse that enables time synchronization are modified and the measurements are computed using the modified time reference. This article exploits the vulnerability of PMUs in their GPS time synchronization signal. At first, the paper proposes an undetectable attack scheme which is able to bypass Bad Data Detection (BDD) algorithms used with PMU data. The attack is applied by solving a convex optimization criterion at regular time interval, so that after a specific time period the attack vector incurs a significant change in the angle information delivered by the PMU. Secondly, the impact of phase angle shift on the power flow calculation between two adjacent nodes of the transmission line is analyzed with numerical experiment using IEEE 39 bus system. Moreover, the undetectibilities of the proposed attack scheme against conventional $\chi^2$, Weighted Least Squares (WLS) and Kalman Filtering test on the estimation residuals are investigated. Finally, the power flow results with the proposed attack are compared with the results using a random GPS-spoofing attack. It can be observed that using the proposed method enables the attacker with more control over the impact against power grid than it is for the random attack. Furthermore, the proposed attack model has demonstrated a very small probability of detection against each of the common detection methods. For WLS and KF, the attack is detected only about 1-20$\%$ of the times, whereas for $\chi^2$-test, the attack goes undetected 100 $\%$ of the times.

[4]  arXiv:2206.12450 [pdf, other]
Title: Fast and Optimal Adaptive Tracking Control: A Novel Meta-Reinforcement Learning via Conditional Generative Adversarial Net
Subjects: Systems and Control (eess.SY)

The control of nonlinear systems with unknown dynamics has been a significant field of research for many years. This paper presents a novel data-driven optimal adaptive control structure with less control effort and faster adaptation than standard adaptive control counterparts. The proposed control structure utilizes the system's recorded data to increase the speed of adaptation and performance dramatically. In this study, we employ a conditional generative adversarial net (CGAN) as a novel central pattern generator to reproduce the steady-state harmonic pattern of the control signals matching the system's uncertainties over a wide range. We can also use the CGAN architecture as a fault detector. The CGAN provides a low-dimensional latent space of uncertainties. It enables rapid and convenient adaptation when there are many parametric uncertainties, especially for large-scale systems. Then, we introduce a novel meta-reinforcement learning framework to adapt the latent space of CGAN to the system's uncertainties as an optimal direct adaptive controller without any system identifier. Another part of the control structure is a regulator that achieves semi-global asymptotic tracking using the Lyapunov stability analysis. Finally, via some simulations, we evaluate the capabilities of the proposed designs on two dynamical systems, a robot manipulator and a large-scale musculoskeletal structure, in the presence of disturbance and perturbation.

[5]  arXiv:2206.12476 [pdf, other]
Title: Adaptive Neural Network Stochastic-Filter-based Controller for Attitude Tracking with Disturbance Rejection
Comments: IEEE Transactions on Neural Networks and Learning Systems
Subjects: Systems and Control (eess.SY)

This paper proposes a real-time neural network (NN) stochastic filter-based controller on the Lie Group of the Special Orthogonal Group $SO(3)$ as a novel approach to the attitude tracking problem. The introduced solution consists of two parts: a filter and a controller. Firstly, an adaptive NN-based stochastic filter is proposed that estimates attitude components and dynamics using measurements supplied by onboard sensors directly. The filter design accounts for measurement uncertainties inherent to the attitude dynamics, namely unknown bias and noise corrupting angular velocity measurements. The closed loop signals of the proposed NN-based stochastic filter have been shown to be semi-globally uniformly ultimately bounded (SGUUB). Secondly, a novel control law on $SO(3)$ coupled with the proposed estimator is presented. The control law addresses unknown disturbances. In addition, the closed loop signals of the proposed filter-based controller have been shown to be SGUUB. The proposed approach offers robust tracking performance by supplying the required control signal given data extracted from low-cost inertial measurement units. While the filter-based controller is presented in continuous form, the discrete implementation is also presented. Additionally, the unit-quaternion form of the proposed approach is given. The effectiveness and robustness of the proposed filter-based controller is demonstrated using its discrete form and considering low sampling rate, high initialization error, high-level of measurement uncertainties, and unknown disturbances. Keywords: Neuro-adaptive, estimator, filter, observer, control system, trajectory tracking, Lyapunov stability, stochastic differential equations, nonlinear filter, attitude tracking control, observer-based controller.

[6]  arXiv:2206.12489 [pdf, other]
Title: Predicting within and across language phoneme recognition performance of self-supervised learning speech pre-trained models
Comments: Submitted to INTERSPEECH 2022
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

In this work, we analyzed and compared speech representations extracted from different frozen self-supervised learning (SSL) speech pre-trained models on their ability to capture articulatory features (AF) information and their subsequent prediction of phone recognition performance for within and across language scenarios. Specifically, we compared CPC, wav2vec 2.0, and HuBert. First, frame-level AF probing tasks were implemented. Subsequently, phone-level end-to-end ASR systems for phoneme recognition tasks were implemented, and the performance on the frame-level AF probing task and the phone accuracy were correlated. Compared to the conventional speech representation MFCC, all SSL pre-trained speech representations captured more AF information, and achieved better phoneme recognition performance within and across languages, with HuBert performing best. The frame-level AF probing task is a good predictor of phoneme recognition performance, showing the importance of capturing AF information in the speech representations. Compared with MFCC, in the within-language scenario, the performance of these SSL speech pre-trained models on AF probing tasks achieved a maximum relative increase of 34.4%, and it resulted in the lowest PER of 10.2%. In the cross-language scenario, the maximum relative increase of 26.7% also resulted in the lowest PER of 23.0%.

[7]  arXiv:2206.12512 [pdf, other]
Title: FetReg2021: A Challenge on Placental Vessel Segmentation and Registration in Fetoscopy
Comments: Submitted to MedIA (Medical Image Analysis)
Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Fetoscopy laser photocoagulation is a widely adopted procedure for treating Twin-to-Twin Transfusion Syndrome (TTTS). The procedure involves photocoagulation pathological anastomoses to regulate blood exchange among twins. The procedure is particularly challenging due to the limited field of view, poor manoeuvrability of the fetoscope, poor visibility, and variability in illumination. These challenges may lead to increased surgery time and incomplete ablation. Computer-assisted intervention (CAI) can provide surgeons with decision support and context awareness by identifying key structures in the scene and expanding the fetoscopic field of view through video mosaicking. Research in this domain has been hampered by the lack of high-quality data to design, develop and test CAI algorithms. Through the Fetoscopic Placental Vessel Segmentation and Registration (FetReg2021) challenge, which was organized as part of the MICCAI2021 Endoscopic Vision challenge, we released the first largescale multicentre TTTS dataset for the development of generalized and robust semantic segmentation and video mosaicking algorithms. For this challenge, we released a dataset of 2060 images, pixel-annotated for vessels, tool, fetus and background classes, from 18 in-vivo TTTS fetoscopy procedures and 18 short video clips. Seven teams participated in this challenge and their model performance was assessed on an unseen test dataset of 658 pixel-annotated images from 6 fetoscopic procedures and 6 short clips. The challenge provided an opportunity for creating generalized solutions for fetoscopic scene understanding and mosaicking. In this paper, we present the findings of the FetReg2021 challenge alongside reporting a detailed literature review for CAI in TTTS fetoscopy. Through this challenge, its analysis and the release of multi-centre fetoscopic data, we provide a benchmark for future research in this field.

[8]  arXiv:2206.12527 [pdf, other]
Title: Infinite Impulse Response Graph Neural Networks for Cyberattack Localization in Smart Grids
Comments: 5 pages, 5 figures
Subjects: Signal Processing (eess.SP); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Systems and Control (eess.SY)

This study employs Infinite Impulse Response (IIR) Graph Neural Networks (GNN) to efficiently model the inherent graph network structure of the smart grid data to address the cyberattack localization problem. First, we numerically analyze the empirical frequency response of the Finite Impulse Response (FIR) and IIR graph filters (GFs) to approximate an ideal spectral response. We show that, for the same filter order, IIR GFs provide a better approximation to the desired spectral response and they also present the same level of approximation to a lower order GF due to their rational type filter response. Second, we propose an IIR GNN model to efficiently predict the presence of cyberattacks at the bus level. Finally, we evaluate the model under various cyberattacks at both sample-wise (SW) and bus-wise (BW) level, and compare the results with the existing architectures. It is experimentally verified that the proposed model outperforms the state-of-the-art FIR GNN model by 9.2% and 14% in terms of SW and BW localization, respectively.

[9]  arXiv:2206.12545 [pdf]
Title: Impedance-based AC/DC Terminal Modeling and Analysis of MMC-BTB system
Subjects: Systems and Control (eess.SY)

Impedance-based small-signal stability analysis is widely applied in practical engineering with modular multilevel converters (MMCs). However, both the deficiencies of existing impedance models (IMs) and the idealized extension for the single MMC influence the analyses in multiterminal systems. Such gaps are filled by focusing on an MMC-based back-to-back system in this paper. To obtain the steady-state trajectory of the system, a numerical method is first proposed based on Newton-Raphson iteration in the frequency domain. Then, by substituting the shared terminal dynamics with active or passive devices, theoretical AC/DC IMs considering typical control loops are established based on the multiharmonic linearization directly, where the pure time delay can be included precisely. Further aided by the derived IMs, two neglected aspects in current literature, i.e., the influence of power transformer on low-frequency impedance characteristics and the rationality of using simplified IMs for the high-frequency resonance study, are investigated especially. It is confirmed that the stability of interlinking systems should be analyzed at both AC and DC terminals comprehensively. This helps position the instability source, obtain the stability margin, and guide the supplementary control strategy. All IMs and analyses are verified by frequency-scan and simulations in PSCAD.

[10]  arXiv:2206.12679 [pdf, ps, other]
Title: Optimal Regulation of Prosumers and Consumers in Smart Energy Communities
Subjects: Systems and Control (eess.SY); Multiagent Systems (cs.MA); Optimization and Control (math.OC)

In a smart energy community, energy prosumers and consumers group together to achieve the community's social welfare. Prosumers are the users that both consume and produce energy. In this paper, we develop algorithms to regulate the number of prosumers and the number of consumers in the smart energy community. We consider that the prosumers have heterogeneous energy sources, such as solar photovoltaic panels and wind turbines. Each prosumer has one of the systems installed in their household. The prosumers and the consumers keep their information private and do not share it with other prosumers or consumers in the community. However, we consider a community manager that keeps track of the total number of active prosumers and consumers and sends feedback signals in the community at each time step. Over a long time, the average number of times a prosumer is active reaches its optimal value; analogously, the average number of times a consumer is active reaches its optimal value and the community achieves the social optimum value. We present the experimental results to check the efficacy of the algorithms.

[11]  arXiv:2206.12729 [pdf, other]
Title: Variable-Depth Simulation of Most Permissive Boolean Networks
Comments: CMSB 2022
Subjects: Systems and Control (eess.SY)

In systems biology, Boolean networks (BNs) aim at modeling the qualitative dynamics of quantitative biological systems. Contrary to their (a)synchronous interpretations, the Most Permissive (MP) interpretation guarantees capturing all the trajectories of any quantitative system compatible with the BN, without additional parameters. Notably, the MP mode has the ability to capture transitions related to the heterogeneity of time scales and concentration scales in the abstracted quantitative system and which are not captured by asynchronous modes. So far, the analysis of MPBNs has focused on Boolean dynamical properties, such as the existence of particular trajectories or attractors. This paper addresses the sampling of trajectories from MPBNs in order to quantify the propensities of attractors reachable from a given initial BN configuration. The computation of MP transitions from a configuration is performed by iteratively discovering possible state changes. The number of iterations is referred to as the permissive depth, where the first depth corresponds to the asynchronous transitions. This permissive depth reflects the potential concentration and time scales heterogeneity along the abstracted quantitative process. The simulation of MPBNs is illustrated on several models from the literature, on which the depth parametrization can help to assess the robustness of predictions on attractor propensities changes triggered by model perturbations.

[12]  arXiv:2206.12742 [pdf, other]
Title: A Planning-free Longitudinal Controller Design for Vehicles in Dynamic Traffic Environments
Authors: Wubing B. Qin
Comments: 11 pages, 8 figures, 1 table
Subjects: Systems and Control (eess.SY)

This paper investigates the longitudinal control problem in a dynamic traffic environment where driving scenarios change between free-driving scenarios and car-following scenarios. A comprehensive longitudinal controller is proposed to ensure reasonable transient response and steady-state response in scenarios changes, which is independent of planning algorithms. This design takes into account passenger comfort, safety concerns and disturbance rejections, and attempts to meet the requirement of lower cost, faster response, increased comfort, enhanced safety and elevated extendability from the automated vehicle industry. Design insights and intuitions are provided in detail. Comprehensive simulations are conducted to demonstrate the efficacy of the proposed controller in different driving scenarios.

[13]  arXiv:2206.12761 [pdf, other]
Title: Singularity-Avoidance Prescribed Performance Attitude Tracking of Spacecraft
Comments: 18 pages,18 figures
Subjects: Systems and Control (eess.SY)

The attitude tracking problem with preassigned performance requirements has earned tremendous interest in recent years, and the Prescribed Performance Control (PPC) scheme is often adopted to tackle this problem. Nevertheless, traditional PPC schemes have inherent problems, which the solution still lacks, such as the singularity problem when the state constraint is violated and the potential over-control problem when the state trajectory approaches the constraint boundary.
This paper proposes a Singularity-Avoidance Prescribed Performance Control scheme (SAPPC) to deal with these problems. A novel shear mapping-based error transformation is proposed to provide a globally non-singular error transformation procedure, while a time-varying constraint boundary is employed to exert appropriate constraint strength at different control stages, alleviating the potential instability caused by the over-control problem. Besides, a novel piece-wise reference performance function (RPF) is constructed to provide a relevant reference trajectory for the state responding signals, allowing precise control of the system's responding behavior. Based on the proposed SAPPC scheme, a backstepping controller is developed, with the predefined-time stability technique and the dynamic surface control technique employed to enhance the controller's robustness and performance. Finally, theoretical analysis and numerical simulation results are presented to validate the proposed control scheme's effectiveness and robustness.

[14]  arXiv:2206.12774 [pdf, other]
Title: Meta Auxiliary Learning for Low-resource Spoken Language Understanding
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

Spoken language understanding (SLU) treats automatic speech recognition (ASR) and natural language understanding (NLU) as a unified task and usually suffers from data scarcity. We exploit an ASR and NLU joint training method based on meta auxiliary learning to improve the performance of low-resource SLU task by only taking advantage of abundant manual transcriptions of speech data. One obvious advantage of such method is that it provides a flexible framework to implement a low-resource SLU training task without requiring access to any further semantic annotations. In particular, a NLU model is taken as label generation network to predict intent and slot tags from texts; a multi-task network trains ASR task and SLU task synchronously from speech; and the predictions of label generation network are delivered to the multi-task network as semantic targets. The efficiency of the proposed algorithm is demonstrated with experiments on the public CATSLU dataset, which produces more suitable ASR hypotheses for the downstream NLU task.

[15]  arXiv:2206.12799 [pdf]
Title: An Efficient Optimal Energy Flow Model for Integrated Energy Systems Based on Energy Circuit Modeling in the Frequency Domain
Subjects: Systems and Control (eess.SY)

With more energy networks being interconnected to form integrated energy systems (IESs), the optimal energy flow (OEF) problem has drawn increasing attention. Extant studies on OEF models mostly utilize the finite difference method (FDM) to deal with partial-differential-equation (PDE) constraints related to the dynamics in natural gas networks (NGNs) and district heating networks (DHNs). However, this time-domain approach suffers from a heavy computational burden with regard to achieving high finite-difference accuracy. In this paper, a novel efficient OEF model is studied. First, by extending the circuit modeling of electric power networks to NGNs and DHNs, an energy circuit method (ECM) that algebraizes the PDE models of NGNs and DHNs in the frequency domain is introduced. Then, an ECM-based OEF model is formulated, which contains fewer variables and constraints compared with an FDM-based OEF model and thereby yields better solving efficiency. Finally, variable space projection is employed to remove implicit variables, by which another constraint generation algorithm is enabled to remove redundant constraints. These two techniques further compact the OEF model and bring about a second improvement in solving efficiency. Numerical tests on large-scale IESs indicate that the final OEF model reduces variables and constraints by more than 95% and improves the solving efficiency by more than 10 times. The related codes will be released upon acceptance.

[16]  arXiv:2206.12800 [pdf]
Title: Unified Energy Circuit-based Integrated Energy Management System: Theory, Implementation, and Application
Subjects: Systems and Control (eess.SY)

Due to their advantages in efficiency and flexibility, integrated energy systems (IESs) have drawn increasing attention in recent years. To exploit the potential of this system, an integrated energy management system (IEMS) is required to perform online analysis and optimization on coupling energy flows including electricity, natural gas, and heat. However, the complicated and long-term dynamic processes in natural gas networks and heating networks constitute a major obstacle to the implementation of IEMSs. In this article, a novel unified energy circuit (UEC) method that models natural gas networks and heating networks in the frequency domain with lump parameters, inspired by the electric-circuit modeling of electricity networks, is proposed. Compared with conventional time-domain modeling methods, this method yields fewer variables and equations under the same accuracy and thereby produces better computational performance. Based on the UEC models, the design and development of the IEMS with advanced applications of dynamic state estimation, energy flow analysis, security assessment and control, optimal energy flow, etc. are presented, which follows the numerical tests for validation. Finally, real-world engineering demonstrations of this IEMS on IESs at different scales are reported.

[17]  arXiv:2206.12809 [pdf]
Title: A Comparison of AIS, X-Band Marine Radar Systems and Camera Surveillance Systems in the Collection of Tracking Data
Journal-ref: International Journal of Recent Research and Applied Studies, Volume 7, Issue 4 (1) April 2020
Subjects: Signal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)

Maritime traffic has increased in recent years, especially in terms of seaborne trade. To ensure safety, security, and protection of the marine environment, several systems have been deployed. To overcome some of their inconveniences, the collected data is typically fused. The fused data is used for various purposes, one of our interest is target tracking. The most relevant systems in that context are AIS and X-band marine radar. Many works consider that visual data provided by camera surveillance systems enable additional advantages. Therefore, many tracking algorithms using visual data (images) have been developed. Yet, there is little emphasis on the reasons making the integration of camera systems important. Thus, our main aim in this paper is to analyze the aforementioned surveillance systems for target tracking and conclude some of the maritime security improvements resulted from the integration of cameras to the overall maritime surveillance system.

[18]  arXiv:2206.12815 [pdf, other]
Title: Breast Cancer Classification using Deep Learned Features Boosted with Handcrafted Features
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Breast cancer is one of the leading causes of death among women across the globe. It is difficult to treat if detected at advanced stages, however, early detection can significantly increase chances of survival and improves lives of millions of women. Given the widespread prevalence of breast cancer, it is of utmost importance for the research community to come up with the framework for early detection, classification and diagnosis. Artificial intelligence research community in coordination with medical practitioners are developing such frameworks to automate the task of detection. With the surge in research activities coupled with availability of large datasets and enhanced computational powers, it expected that AI framework results will help even more clinicians in making correct predictions. In this article, a novel framework for classification of breast cancer using mammograms is proposed. The proposed framework combines robust features extracted from novel Convolutional Neural Network (CNN) features with handcrafted features including HOG (Histogram of Oriented Gradients) and LBP (Local Binary Pattern). The obtained results on CBIS-DDSM dataset exceed state of the art.

[19]  arXiv:2206.12836 [pdf, other]
Title: Joint Location and Beamforming Design for STAR-RIS Assisted NOMA Systems
Subjects: Signal Processing (eess.SP)

Simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) assisted non-orthogonal multiple access (NOMA) communication systems are investigated in its vicinity, where a STAR-RIS is deployed within a predefined region for establishing communication links for users. Both beamformer-based NOMA and cluster-based NOMA schemes are employed at the multi-antenna base station (BS). For each scheme, the STAR-RIS deployment location, the passive transmitting and reflecting beamforming (BF) of the STAR-RIS, and the active BF at the BS are jointly optimized for maximizing the weighted sum-rate (WSR) of users. To solve the resultant non-convex problems, an alternating optimization (AO) algorithm is proposed, where successive convex approximation (SCA) and semi-definite programming (SDP) methods are invoked for iteratively addressing the non-convexity of each sub-problem. Numerical results reveal that 1) the WSR performance can be significantly enhanced by optimizing the specific deployment location of the STAR-RIS; 2) both beamformer-based and cluster-based NOMA prefer asymmetric STAR-RIS deployment.

[20]  arXiv:2206.12857 [pdf, other]
Title: Transport-Oriented Feature Aggregation for Speaker Embedding Learning
Comments: Accepted for presentation at INTERSPEECH 2022
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

Pooling is needed to aggregate frame-level features into utterance-level representations for speaker modeling. Given the success of statistics-based pooling methods, we hypothesize that speaker characteristics are well represented in the statistical distribution over the pre-aggregation layer's output, and propose to use transport-oriented feature aggregation for deriving speaker embeddings. The aggregated representation encodes the geometric structure of the underlying feature distribution, which is expected to contain valuable speaker-specific information that may not be represented by the commonly used statistical measures like mean and variance. The original transport-oriented feature aggregation is also extended to a weighted-frame version to incorporate the attention mechanism. Experiments on speaker verification with the Voxceleb dataset show improvement over statistics pooling and its attentive variant.

[21]  arXiv:2206.12868 [pdf, other]
Title: Polarimetric phase retrieval: uniqueness and algorithms
Comments: 37 pages, 10 figures
Subjects: Signal Processing (eess.SP)

This work introduces a novel Fourier phase retrieval model, called polarimetric phase retrieval that enables a systematic use of polarization information in Fourier phase retrieval problems. We provide a complete characterization of uniqueness properties of this new model by unraveling equivalencies with a peculiar polynomial factorization problem. We introduce two different but complementary categories of reconstruction methods. The first one is algebraic and relies on the use of approximate greatest common divisor computations using Sylvester matrices. The second one carefully adapts existing algorithms for Fourier phase retrieval, namely semidefinite positive relaxation and Wirtinger-Flow, to solve the polarimetric phase retrieval problem. Finally, a set of numerical experiments permits a detailed assessment of the numerical behavior and relative performances of each proposed reconstruction strategy. We further highlight a reconstruction strategy that combines both approaches for scalable, computationally efficient and asymptotically MSE optimal performance.

[22]  arXiv:2206.12894 [pdf, ps, other]
Title: Meta-material Sensor Based Internet of Things: Design, Optimization, and Implementation
Comments: 40 pages, 13 figures
Subjects: Signal Processing (eess.SP)

For many applications envisioned for the Internet of Things (IoT), it is expected that the sensors will have very low costs and zero power, which can be satisfied by meta-material sensor based IoT, i.e., meta-IoT. As their constituent meta-materials can reflect wireless signals with environment-sensitive reflection coefficients, meta-IoT sensors can achieve simultaneous sensing and transmission without any active modulation. However, to maximize the sensing accuracy, the structures of meta-IoT sensors need to be optimized considering their joint influence on sensing and transmission, which is challenging due to the high computational complexity in evaluating the influence, especially given a large number of sensors. In this paper, we propose a joint sensing and transmission design method for meta-IoT systems with a large number of meta-IoT sensors, which can efficiently optimize the sensing accuracy of the system. Specifically, a computationally efficient received signal model is established to evaluate the joint influence of meta-material structure on sensing and transmission. Then, a sensing algorithm based on deep unsupervised learning is designed to obtain accurate sensing results in a robust manner. Experiments with a prototype verify that the system has a higher sensitivity and a longer transmission range compared to existing designs, and can sense environmental anomalies correctly within 2 meters.

[23]  arXiv:2206.12908 [pdf, other]
Title: CNN-aided Channel and Carrier Frequency Offset Estimation for HAPS-LEO Links
Comments: 6 pages, 7 figures
Subjects: Signal Processing (eess.SP)

Low Earth orbit (LEO) satellite mega-constellation networks aim to address the high connectivity demands with a projected 50,000 satellites in less than a decade. To fully utilize such a large-scale dynamic network, an air network composed of stratospheric nodes, specifically high altitude platform station (HAPS), can help significantly with a number of aspects including mobility management. HAPS-LEO network will be subject to time-varying conditions, and in this paper, we introduce an artificial intelligence (AI)-based approach for the unique channel estimation and synchronization problems. First, channel equalization and carrier frequency offset with residual Doppler effects are minimized by using the proposed convolutional neural networks based estimator. Then, the data rate is compounded by increasing spectral efficiency using non-orthogonal multiple access method. We observed that the proposed AI-empowered HAPS-LEO network provides not only a high data throughput per second but also higher service quality thanks to the agile signal reconstruction process.

[24]  arXiv:2206.12967 [pdf, other]
Title: RF Signal Classification with Synthetic Training Data and its Real-World Performance
Authors: Stefan Scholl
Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG)

Neural nets are a powerful method for the classification of radio signals in the electromagnetic spectrum. These neural nets are often trained with synthetically generated data due to the lack of diverse and plentiful real RF data. However, it is often unclear how neural nets trained on synthetic data perform in real-world applications. This paper investigates the impact of different RF signal impairments (such as phase, frequency and sample rate offsets, receiver filters, noise and channel models) modeled in synthetic training data with respect to the real-world performance. For that purpose, this paper trains neural nets with various synthetic training datasets with different signal impairments. After training, the neural nets are evaluated against real-world RF data collected by a software defined radio receiver in the field. This approach reveals which modeled signal impairments should be included in carefully designed synthetic datasets. The investigated showcase example can classify RF signals into one of 20 different radio signal types from the shortwave bands. It achieves an accuracy of up to 95 % in real-world operation by using carefully designed synthetic training data only.

[25]  arXiv:2206.12980 [pdf]
Title: Detecting Schizophrenia with 3D Structural Brain MRI Using Deep Learning
Comments: 13 pages, 6 figures
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

Schizophrenia is a chronic neuropsychiatric disorder that causes distinct structural alterations within the brain. We hypothesize that deep learning applied to a structural neuroimaging dataset could detect disease-related alteration and improve classification and diagnostic accuracy. We tested this hypothesis using a single, widely available, and conventional T1-weighted MRI scan, from which we extracted the 3D whole-brain structure using standard post-processing methods. A deep learning model was then developed, optimized, and evaluated on three open datasets with T1-weighted MRI scans of patients with schizophrenia. Our proposed model outperformed the benchmark model, which was also trained with structural MR images using a 3D CNN architecture. Our model is capable of almost perfectly (area under the ROC curve = 0.987) distinguishing schizophrenia patients from healthy controls on unseen structural MRI scans. Regional analysis localized subcortical regions and ventricles as the most predictive brain regions. Subcortical structures serve a pivotal role in cognitive, affective, and social functions in humans, and structural abnormalities of these regions have been associated with schizophrenia. Our finding corroborates that schizophrenia is associated with widespread alterations in subcortical brain structure and the subcortical structural information provides prominent features in diagnostic classification. Together, these results further demonstrate the potential of deep learning to improve schizophrenia diagnosis and identify its structural neuroimaging signatures from a single, standard T1-weighted brain MRI.

[26]  arXiv:2206.13007 [pdf, other]
Title: Mobility State Detection of Cellular-Connected UAVs based on Handover Count Statistics
Comments: Submitted to IEEE TVT. arXiv admin note: text overlap with arXiv:2002.06657
Subjects: Signal Processing (eess.SP)

To ensure reliable and effective mobility management for aerial user equipment (UE), estimating the speed of cellular-connected unmanned aerial vehicles (UAVs) carries critical importance since this can help to improve the quality of service of the cellular network. The 3GPP LTE standard uses the number of handovers made by a UE during a predefined time period to estimate the speed and the mobility state efficiently. In this paper, we introduce an approximation to the probability mass function of handover count (HOC) as a function of a cellular-connected UAV's height and velocity, HOC measurement time window, and different ground base station (GBS) densities. Afterward, we derive the Cramer-Rao lower bound (CRLB) for the speed estimate of a UAV, and also provide a simple biased estimator for the UAV's speed which depends on the GBS density and HOC measurement period. Interestingly, for a low time-to-trigger (TTT) parameter, the biased estimator turns into a minimum variance unbiased estimator (MVUE). By exploiting this speed estimator, we study the problem of detecting the mobility state of a UAV as low, medium, or high mobility as per the LTE specifications. Using CRLBs and our proposed MVUE, we characterize the accuracy improvement in speed estimation and mobility state detection as the GBS density and the HOC measurement window increase. Our analysis also shows that the accuracy of the proposed estimator does not vary significantly with respect to the TTT parameter.

[27]  arXiv:2206.13014 [pdf, other]
Title: Joint Optimization of Sampling Rate Offsets Based on Entire Signal Relationship Among Distributed Microphones
Comments: 5 pages, 2 figures,accepted by Interspeech2022
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)

In this paper, we propose to simultaneously estimate all the sampling rate offsets (SROs) of multiple devices. In a distributed microphone array, the SRO is inevitable, which deteriorates the performance of array signal processing. Most of the existing SRO estimation methods focused on synchronizing two microphones. When synchronizing more than two microphones, we select one reference microphone and estimate the SRO of each non-reference microphone independently. Hence, the relationship among signals observed by non-reference microphones is not considered. To address this problem, the proposed method jointly optimizes all SROs based on a probabilistic model of a multichannel signal. The SROs and model parameters are alternately updated to increase the log-likelihood based on an auxiliary function. The effectiveness of the proposed method is validated on mixtures of various numbers of speakers.

[28]  arXiv:2206.13016 [pdf, other]
Title: Unsupervised Instance Discriminative Learning for Depression Detection from Speech Signals
Subjects: Audio and Speech Processing (eess.AS); Quantitative Methods (q-bio.QM)

Major Depressive Disorder (MDD) is a severe illness that affects millions of people, and it is critical to diagnose this disorder as early as possible. Detecting depression from voice signals can be of great help to physicians and can be done without any invasive procedure. Since relevant labelled data are scarce, we propose a modified Instance Discriminative Learning (IDL) method, an unsupervised pre-training technique, to extract augment-invariant and instance-spread-out embeddings. In terms of learning augment-invariant embeddings, various data augmentation methods for speech are investigated, and time-masking yields the best performance. To learn instance-spread-out embeddings, we explore methods for sampling instances for a training batch (distinct speaker-based and random sampling). It is found that the distinct speaker-based sampling provides better performance than the random one, and we hypothesize that this result is because relevant speaker information is preserved in the embedding. Additionally, we propose a novel sampling strategy, Pseudo Instance-based Sampling (PIS), based on clustering algorithms, to enhance spread-out characteristics of the embeddings. Experiments are conducted with DepAudioNet on DAIC-WOZ (English) and CONVERGE (Mandarin) datasets, and statistically significant improvements, with p-value 0.0015 and 0.05, respectively, are observed using PIS in the detection of MDD relative to the baseline without pre-training.

[29]  arXiv:2206.13017 [pdf, ps, other]
Title: Safe Schedule Verification for Urban Air Mobility Networks with Node Closures
Comments: 12 pages
Subjects: Systems and Control (eess.SY)

In Urban Air Mobility (UAM) networks, takeoff and landing sites, called vertiports, are likely to experience intermittent closures due to, e.g., adverse weather. To ensure safety, all in-flight Urban Air Vehicles (UAVs) in a UAM network must therefore have alternative landing sites with sufficient landing capacity in the event of a vertiport closure. In this paper, we study the problem of safety verification of UAM schedules in the face of vertiport closures. We first provide necessary and sufficient conditions for a given UAM schedule to be safe in the sense that, if a vertiport closure occurs, then all UAVs will be able to safely land at a backup landing site. Next, we convert these conditions to an efficient algorithm for verifying safety of a UAM schedule via a linear program by using properties of totally unimodular matrices. Our algorithm allows for uncertain travel time between UAM vertiports and scales quadratically with the number of scheduled UAVs. We demonstrate our algorithm on a UAM network with up to 1,000 UAVs.

[30]  arXiv:2206.13044 [pdf, other]
Title: Extended U-Net for Speaker Verification in Noisy Environments
Comments: 5 pages, 2 figures, 4 tables, accepted to 2022 Interspeech as a conference paper
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

Background noise is a well-known factor that deteriorates the accuracy and reliability of speaker verification (SV) systems by blurring speech intelligibility. Various studies have used separate pretrained enhancement models as the front-end module of the SV system in noisy environments, and these methods effectively remove noises. However, the denoising process of independent enhancement models not tailored to the SV task can also distort the speaker information included in utterances. We argue that the enhancement network and speaker embedding extractor should be fully jointly trained for SV tasks under noisy conditions to alleviate this issue. Therefore, we proposed a U-Net-based integrated framework that simultaneously optimizes speaker identification and feature enhancement losses. Moreover, we analyzed the structural limitations of using U-Net directly for noise SV tasks and further proposed Extended U-Net to reduce these drawbacks. We evaluated the models on the noise-synthesized VoxCeleb1 test set and VOiCES development set recorded in various noisy scenarios. The experimental results demonstrate that the U-Net-based fully joint training framework is more effective than the baseline, and the extended U-Net exhibited state-of-the-art performance versus the recently proposed compensation systems.

[31]  arXiv:2206.13058 [pdf, other]
Title: Attitude estimation from vector measurements: Necessary and sufficient conditions and convergent observer design
Subjects: Systems and Control (eess.SY)

The paper addresses the problem of attitude estimation for rigid bodies using (possibly time-varying) vector measurements, for which we provide a necessary and sufficient condition of distinguishability. Such a condition is shown to be strictly weaker than those previously used for attitude observer design. Thereafter, we show that even for the single vector case the resulting condition is sufficient to design almost globally convergent attitude observers, and two explicit designs are obtained. To overcome the weak excitation issue, the first design employs to make full use of historical information, whereas the second scheme dynamically generates a virtual reference vector, which remains non-collinear to the given vector measurement. Simulation results illustrate the accurate estimation despite noisy measurements.

[32]  arXiv:2206.13066 [pdf, other]
Title: Detection of Doctored Speech: Towards an End-to-End Parametric Learn-able Filter Approach
Authors: Rohit Arora
Comments: arXiv admin note: text overlap with arXiv:1904.05441 by other authors
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

The Automatic Speaker Verification systems have potential in biometrics applications for logical control access and authentication. A lot of things happen to be at stake if the ASV system is compromised. The preliminary work presents a comparative analysis of the wavelet and MFCC-based state-of-the-art spoof detection techniques developed in these papers, respectively (Novoselov et al., 2016) (Alam et al., 2016a). The results on ASVspoof 2015 justify our inclination towards wavelet-based features instead of MFCC features. The experiments on the ASVspoof 2019 database show the lack of credibility of the traditional handcrafted features and give us more reason to progress towards using end-to-end deep neural networks and more recent techniques. We use Sincnet architecture as our baseline. We get E2E deep learning models, which we call WSTnet and CWTnet, respectively, by replacing the Sinc layer with the Wavelet Scattering and Continuous wavelet transform layers. The fusion model achieved 62% and 17% relative improvement over traditional handcrafted models and our Sincnet baseline when evaluated on the modern spoofing attacks in ASVspoof 2019.
The final scale distribution and the number of scales used in CWTnet are far from optimal for the task at hand. So to solve this problem, we replaced the CWT layer with a Wavelet Deconvolution(WD) (Khan and Yener, 2018) layer in our CWTnet architecture. This layer calculates the Discrete-Continuous Wavelet Transform similar to the CWTnet but also optimizes the scale parameter using back-propagation. The WDnet model achieved 26% and 7% relative improvement over CWTnet and Sincnet models respectively when evaluated over ASVspoof 2019 dataset. This shows that more generalized features are extracted as compared to the features extracted by CWTnet as only the most important and relevant frequency regions are focused upon.

[33]  arXiv:2206.13084 [pdf, other]
Title: State and Input Constrained Model Reference Adaptive Control
Comments: 6 pages, 3 figures
Subjects: Systems and Control (eess.SY)

Satisfaction of state and input constraints is one of the most critical requirements in control engineering applications. In classical model reference adaptive control (MRAC) formulation, although the states and the input remain bounded, the bound is neither user-defined nor known a-priori. In this paper, an MRAC is developed for multivariable linear time-invariant (LTI) plant with user-defined state and input constraints using a simple saturated control design coupled with a barrier Lyapunov function (BLF). Without any restrictive assumptions that may limit practical implementation, the proposed controller guarantees that both the plant state and the control input remain within a user-defined safe set for all time while simultaneously ensuring that the plant state trajectory tracks the reference model trajectory. The controller ensures that all the closed-loop signals remain bounded and the trajectory tracking error converges to zero asymptotically. Simulation results validate the efficacy of the proposed constrained MRAC in terms of better tracking performance and limited control effort compared to the standard MRAC algorithm.

[34]  arXiv:2206.13123 [pdf, other]
Title: Unsupervised Domain Adaptation Using Feature Disentanglement And GCNs For Medical Image Classification
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

The success of deep learning has set new benchmarks for many medical image analysis tasks. However, deep models often fail to generalize in the presence of distribution shifts between training (source) data and test (target) data. One method commonly employed to counter distribution shifts is domain adaptation: using samples from the target domain to learn to account for shifted distributions. In this work we propose an unsupervised domain adaptation approach that uses graph neural networks and, disentangled semantic and domain invariant structural features, allowing for better performance across distribution shifts. We propose an extension to swapped autoencoders to obtain more discriminative features. We test the proposed method for classification on two challenging medical image datasets with distribution shifts - multi center chest Xray images and histopathology images. Experiments show our method achieves state-of-the-art results compared to other domain adaptation methods.

[35]  arXiv:2206.13127 [pdf, other]
Title: Intelligent Omni-Surfaces (IOSs) for the MIMO Broadcast Channel
Comments: Accepted to be published in the 23rd IEEE International Workshop on Signal Processing Advances in Wireless Communications (SPAWC 2022)
Subjects: Signal Processing (eess.SP); Information Theory (cs.IT)

In this paper, we consider intelligent omni-surfaces (IOSs), which are capable of simultaneously reflecting and refracting electromagnetic waves. We focus our attention on the multiple-input multiple-output (MIMO) broadcast channel, and we introduce an algorithm for jointly optimizing the covariance matrix at the base station, the matrix of reflection and transmission coefficients at the IOS, and the amount of power that is reflected and refracted from the IOS. The distinguishable feature of this work lies in taking into account that the reflection and transmission coefficients of an IOS are tightly coupled. Simulation results are illustrated to show the convergence of the proposed algorithm and the benefits of using surfaces with simultaneous reflection and refraction capabilities.

[36]  arXiv:2206.13129 [pdf, other]
Title: Mitigating Load-Altering Attacks Against Power Grids Using Cyber-Resilient Economic Dispatch
Subjects: Systems and Control (eess.SY)

Large-scale Load-Altering Attacks (LAAs) against Internet-of-Things (IoT) enabled high-wattage electrical appliances (e.g., wifi-enabled air-conditioners, electric vehicles, etc.) pose a serious threat to power systems' security and stability. In this work, a Cyber-Resilient Economic Dispatch (CRED) framework is presented to mitigate the destabilizing effect of LAAs while minimizing the overall operational cost by dynamically optimizing the frequency droop control gains of Inverter-Based Resources (IBRs). The system frequency dynamics incorporating both LAAs and the IBR droop control are modeled. The system stability constraints are explicitly derived based on parametric sensitivities. To incorporate them into the CRED model and minimize the error of the sensitivity analysis, a recursive linearization method is further proposed. A distributionally robust approach is applied to account for the uncertainty associated with the LAA detection/parameter estimation. The overall performance of the proposed CRED model is demonstrated through simulations in a modified IEEE reliability test system.

[37]  arXiv:2206.13173 [pdf, ps, other]
Title: Context-Aware Transformers For Spinal Cancer Detection and Radiological Grading
Comments: Pre-print of paper accepted to MICCAI 2022. 15 pages, 7 figures
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

This paper proposes a novel transformer-based model architecture for medical imaging problems involving analysis of vertebrae. It considers two applications of such models in MR images: (a) detection of spinal metastases and the related conditions of vertebral fractures and metastatic cord compression, (b) radiological grading of common degenerative changes in intervertebral discs. Our contributions are as follows: (i) We propose a Spinal Context Transformer (SCT), a deep-learning architecture suited for the analysis of repeated anatomical structures in medical imaging such as vertebral bodies (VBs). Unlike previous related methods, SCT considers all VBs as viewed in all available image modalities together, making predictions for each based on context from the rest of the spinal column and all available imaging modalities. (ii) We apply the architecture to a novel and important task: detecting spinal metastases and the related conditions of cord compression and vertebral fractures/collapse from multi-series spinal MR scans. This is done using annotations extracted from free-text radiological reports as opposed to bespoke annotation. However, the resulting model shows strong agreement with vertebral-level bespoke radiologist annotations on the test set. (iii) We also apply SCT to an existing problem: radiological grading of inter-vertebral discs (IVDs) in lumbar MR scans for common degenerative changes.We show that by considering the context of vertebral bodies in the image, SCT improves the accuracy for several gradings compared to previously published model.

[38]  arXiv:2206.13203 [pdf, ps, other]
Title: MIMO Symbiotic Radio with Massive Passive Devices: Asymptotic Analysis and Precoding Optimization
Comments: 12 pages, 7 figures, submitted to IEEE Transactions on Communications. arXiv admin note: text overlap with arXiv:2106.05789
Subjects: Signal Processing (eess.SP)

Symbiotic radio has emerged as a promising technology for spectrum- and energy-efficient wireless communications, where the passive secondary backscatter devices (BDs) reuse not only the spectrum but also the power of the active primary users to transmit their own information. In return, the primary communication links can be enhanced by the additional multipaths created by the BDs. This is known as the mutualism relationship of symbiotic radio. However, due to the severe double-fading attenuation of the passive backscattering links, the enhancement of the primary link provided by one single BD is extremely limited. To address this issue and enable full mutualism of symbiotic radio, in this paper, we study multiple-input multiple output (MIMO) symbiotic radio communication systems with massive BDs. We first derive the achievable rates of the primary active communication and secondary passive communication, and then consider the asymptotic regime as the number of BDs goes large, for which closed-form expressions are derived to reveal the relationship between the primary and secondary communication rates. Furthermore, the precoding optimization problem is studied to maximize the primary communication rate while guaranteeing that the secondary communication rate is no smaller than a certain threshold. Simulation results are provided to validate our theoretical studies.

[39]  arXiv:2206.13231 [pdf, other]
Title: QbyE-MLPMixer: Query-by-Example Open-Vocabulary Keyword Spotting using MLPMixer
Comments: Accepted to INTERSPEECH 2022
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)

Current keyword spotting systems are typically trained with a large amount of pre-defined keywords. Recognizing keywords in an open-vocabulary setting is essential for personalizing smart device interaction. Towards this goal, we propose a pure MLP-based neural network that is based on MLPMixer - an MLP model architecture that effectively replaces the attention mechanism in Vision Transformers. We investigate different ways of adapting the MLPMixer architecture to the QbyE open-vocabulary keyword spotting task. Comparisons with the state-of-the-art RNN and CNN models show that our method achieves better performance in challenging situations (10dB and 6dB environments) on both the publicly available Hey-Snips dataset and a larger scale internal dataset with 400 speakers. Our proposed model also has a smaller number of parameters and MACs compared to the baseline models.

[40]  arXiv:2206.13232 [pdf, other]
Title: Conformer Based Elderly Speech Recognition System for Alzheimer's Disease Detection
Comments: 5 pages, 1 figure, accepted by INTERSPEECH 2022
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)

Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care to delay further progression. This paper presents the development of a state-of-the-art Conformer based speech recognition system built on the DementiaBank Pitt corpus for automatic AD detection. The baseline Conformer system trained with speed perturbation and SpecAugment based data augmentation is significantly improved by incorporating a set of purposefully designed modeling features, including neural architecture search based auto-configuration of domain-specific Conformer hyper-parameters in addition to parameter fine-tuning; fine-grained elderly speaker adaptation using learning hidden unit contributions (LHUC); and two-pass cross-system rescoring based combination with hybrid TDNN systems. An overall word error rate (WER) reduction of 13.6% absolute (34.8% relative) was obtained on the evaluation data of 48 elderly speakers. Using the final systems' recognition outputs to extract textual features, the best-published speech recognition based AD detection accuracy of 91.7% was obtained.

[41]  arXiv:2206.13235 [pdf, other]
Title: Bayesian Neural Network Detector for an Orthogonal Time Frequency Space Modulation
Comments: Submitted to IEEE Wireless Communication Letter
Subjects: Signal Processing (eess.SP)

The orthogonal time-frequency space (OTFS) modulation is proposed for beyond 5G wireless systems to deal with high mobility communications. The existing low complexity OTFS detectors exhibit poor performance in rich scattering environments where there are a large number of moving reflectors that reflect the transmitted signal towards the receiver. In this paper, we propose an OTFS detector, referred to as the BPICNet OTFS detector that integrates NN, Bayesian inference, and parallel interference cancellation concepts. Simulation results show that the proposed OTFS detector significantly outperforms the state-of-the-art.

[42]  arXiv:2206.13236 [pdf, other]
Title: Pruned RNN-T for fast, memory-efficient ASR training
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

The RNN-Transducer (RNN-T) framework for speech recognition has been growing in popularity, particularly for deployed real-time ASR systems, because it combines high accuracy with naturally streaming recognition. One of the drawbacks of RNN-T is that its loss function is relatively slow to compute, and can use a lot of memory. Excessive GPU memory usage can make it impractical to use RNN-T loss in cases where the vocabulary size is large: for example, for Chinese character-based ASR. We introduce a method for faster and more memory-efficient RNN-T loss computation. We first obtain pruning bounds for the RNN-T recursion using a simple joiner network that is linear in the encoder and decoder embeddings; we can evaluate this without using much memory. We then use those pruning bounds to evaluate the full, non-linear joiner network.

[43]  arXiv:2206.13240 [pdf, other]
Title: A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data
Comments: Accepted at ECNLP @ACL 2022
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)

Automatic Speech Recognition(ASR) has been dominated by deep learning-based end-to-end speech recognition models. These approaches require large amounts of labeled data in the form of audio-text pairs. Moreover, these models are more susceptible to domain shift as compared to traditional models. It is common practice to train generic ASR models and then adapt them to target domains using comparatively smaller data sets. We consider a more extreme case of domain adaptation where text-only corpus is available. In this work, we propose a simple baseline technique for domain adaptation in end-to-end speech recognition models. We convert the text-only corpus to audio data using single speaker Text to Speech (TTS) engine. The parallel data in the target domain is then used to fine-tune the final dense layer of generic ASR models. We show that single speaker synthetic TTS data coupled with final dense layer only fine-tuning provides reasonable improvements in word error rates. We use text data from address and e-commerce search domains to show the effectiveness of our low-cost baseline approach on CTC and attention-based models.

[44]  arXiv:2206.13245 [pdf, other]
Title: Performance Evaluation of Dynamic Metasurface Antennas: Impact of Insertion Losses and Coupling
Subjects: Signal Processing (eess.SP)

This paper evaluates the performance of multi-user massive multiple-input multiple-output (MIMO) systems in which the base station is equipped with a dynamic metasurface antenna (DMA). Due to the physical implementation of DMAs, conventional models widely-used in MIMO are no longer valid, and electromagnetic phenomena such as mutual coupling, insertion losses and reflections inside the waveguides need to be considered. Hence, starting from a recently proposed electromagnetic model for DMAs, we formulate a zero-forcing optimization problem, yielding an unconstrained objective function with known gradient. The performance is compared with that of full-digital and hybrid massive MIMO, focusing on the impact of insertion losses and mutual coupling.

[45]  arXiv:2206.13272 [pdf, other]
Title: Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities
Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)

Wideband Audio Waveform Evaluation Networks (WAWEnets) are convolutional neural networks that operate directly on wideband audio waveforms in order to produce evaluations of those waveforms. In the present work these evaluations give qualities of telecommunications speech (e.g., noisiness, intelligibility, overall speech quality). WAWEnets are no-reference networks because they do not require ``reference'' (original or undistorted) versions of the waveforms they evaluate. Our initial WAWEnet publication introduced four WAWEnets and each emulated the output of an established full-reference speech quality or intelligibility estimation algorithm.
We have updated the WAWEnet architecture to be more efficient and effective. Here we present a single WAWEnet that closely tracks seven different quality and intelligibility values. We create a second network that additionally tracks four subjective speech quality dimensions. We offer a third network that focuses on just subjective quality scores and achieves very high levels of agreement. This work has leveraged 334 hours of speech in 13 languages, over two million full-reference target values and over 93,000 subjective mean opinion scores.
We also interpret the operation of WAWEnets and identify the key to their operation using the language of signal processing: ReLUs strategically move spectral information from non-DC components into the DC component. The DC values of 96 output signals define a vector in a 96-D latent space and this vector is then mapped to a quality or intelligibility value for the input waveform.

[46]  arXiv:2206.13279 [pdf, other]
Title: Differential invariants for SE(2)-equivariant networks
Journal-ref: 29th IEEE International Conference on Image Processing (IEEE ICIP), Oct 2022, Bordeaux, France
Subjects: Image and Video Processing (eess.IV)

Symmetry is present in many tasks in computer vision, where the same class of objects can appear transformed, e.g. rotated due to different camera orientations, or scaled due to perspective. The knowledge of such symmetries in data coupled with equivariance of neural networks can improve their generalization to new samples. Differential invariants are equivariant operators computed from the partial derivatives of a function. In this paper we use differential invariants to define equivariant operators that form the layers of an equivariant neural network. Specifically, we derive invariants of the Special Euclidean Group SE(2), composed of rotations and translations, and apply them to construct a SE(2)-equivariant network, called SE(2) Differential Invariants Network (SE2DINNet). The network is subsequently tested in classification tasks which require a degree of equivariance or invariance to rotations. The results compare positively with the state-of-the-art, even though the proposed SE2DINNet has far less parameters than the compared models.

[47]  arXiv:2206.13295 [pdf, other]
Title: Diffusion Deformable Model for 4D Temporal Medical Image Generation
Comments: Accepted for MICCAI 2022
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Temporal volume images with 3D+t (4D) information are often used in medical imaging to statistically analyze temporal dynamics or capture disease progression. Although deep-learning-based generative models for natural images have been extensively studied, approaches for temporal medical image generation such as 4D cardiac volume data are limited. In this work, we present a novel deep learning model that generates intermediate temporal volumes between source and target volumes. Specifically, we propose a diffusion deformable model (DDM) by adapting the denoising diffusion probabilistic model that has recently been widely investigated for realistic image generation. Our proposed DDM is composed of the diffusion and the deformation modules so that DDM can learn spatial deformation information between the source and target volumes and provide a latent code for generating intermediate frames along a geodesic path. Once our model is trained, the latent code estimated from the diffusion module is simply interpolated and fed into the deformation module, which enables DDM to generate temporal frames along the continuous trajectory while preserving the topology of the source image. We demonstrate the proposed method with the 4D cardiac MR image generation between the diastolic and systolic phases for each subject. Compared to the existing deformation methods, our DDM achieves high performance on temporal volume generation.

[48]  arXiv:2206.13310 [pdf, other]
Title: Insights into Deep Non-linear Filters for Improved Multi-channel Speech Enhancement
Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)

The key advantage of using multiple microphones for speech enhancement is that spatial filtering can be used to complement the tempo-spectral processing. In a traditional setting, linear spatial filtering (beamforming) and single-channel post-filtering are commonly performed separately. In contrast, there is a trend towards employing deep neural networks (DNNs) to learn a joint spatial and tempo-spectral non-linear filter, which means that the restriction of a linear processing model and that of a separate processing of spatial and tempo-spectral information can potentially be overcome. However, the internal mechanisms that lead to good performance of such data-driven filters for multi-channel speech enhancement are not well understood. Therefore, in this work, we analyse the properties of a non-linear spatial filter realized by a DNN as well as its interdependency with temporal and spectral processing by carefully controlling the information sources (spatial, spectral, and temporal) available to the network. We confirm the superiority of a non-linear spatial processing model, which outperforms an oracle linear spatial filter in a challenging speaker extraction scenario for a low number of microphones by 0.24 POLQA score. Our analyses reveal that in particular spectral information should be processed jointly with spatial information as this increases the spatial selectivity of the filter. Our systematic evaluation then leads to a simple network architecture, that outperforms state-of-the-art network architectures on a speaker extraction task by 0.22 POLQA score and by 0.32 POLQA score on the CHiME3 data.

[49]  arXiv:2206.13319 [pdf, other]
Title: Safe, Learning-Based MPC for Highway Driving under Lane-Change Uncertainty: A Distributionally Robust Approach
Comments: Under review
Subjects: Systems and Control (eess.SY)

We present a case study applying learning-based distributionally robust model predictive control to highway motion planning under stochastic uncertainty of the lane change behavior of surrounding road users. The dynamics of road users are modelled using Markov jump systems, in which the switching variable describes the desired lane of the vehicle under consideration. We assume the switching probabilities of the underlying Markov chain to be unknown. As the vehicle is observed and thus, samples from the Markov chain are drawn, the transition probabilities are estimated along with an ambiguity set which accounts for misestimations of these probabilities. Correspondingly, a distributionally robust optimal control problem is formulated over a scenario tree, and solved in receding horizon. As a result, a motion planning procedure is obtained which through observation of the target vehicle gradually becomes less conservative while avoiding overconfidence in estimates obtained from small sample sizes. We present an extensive numerical case study, comparing the effects of several different design aspects on the controller performance and safety.

[50]  arXiv:2206.13365 [pdf, other]
Title: Interpretable Acoustic Representation Learning on Breathing and Speech Signals for COVID-19 Detection
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)

In this paper, we describe an approach for representation learning of audio signals for the task of COVID-19 detection. The raw audio samples are processed with a bank of 1-D convolutional filters that are parameterized as cosine modulated Gaussian functions. The choice of these kernels allows the interpretation of the filterbanks as smooth band-pass filters. The filtered outputs are pooled, log-compressed and used in a self-attention based relevance weighting mechanism. The relevance weighting emphasizes the key regions of the time-frequency decomposition that are important for the downstream task. The subsequent layers of the model consist of a recurrent architecture and the models are trained for a COVID-19 detection task. In our experiments on the Coswara data set, we show that the proposed model achieves significant performance improvements over the baseline system as well as other representation learning approaches. Further, the approach proposed is shown to be uniformly applicable for speech and breathing signals and for transfer learning from a larger data set.

[51]  arXiv:2206.13374 [pdf, other]
Title: Stability Verification of Neural Network Controllers using Mixed-Integer Programming
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)

We propose a framework for the stability verification of Mixed-Integer Linear Programming (MILP) representable control policies. This framework compares a fixed candidate policy, which admits an efficient parameterization and can be evaluated at a low computational cost, against a fixed baseline policy, which is known to be stable but expensive to evaluate. We provide sufficient conditions for the closed-loop stability of the candidate policy in terms of the worst-case approximation error with respect to the baseline policy, and we show that these conditions can be checked by solving a Mixed-Integer Quadratic Program (MIQP). Additionally, we demonstrate that an outer approximation of the stability region of the candidate policy can be computed by solving an MILP. The proposed framework is sufficiently general to accommodate a broad range of candidate policies including ReLU Neural Networks (NNs), optimal solution maps of parametric quadratic programs, and Model Predictive Control (MPC) policies. We also present an open-source toolbox in Python based on the proposed framework, which allows for the easy verification of custom NN architectures and MPC formulations. We showcase the flexibility and reliability of our framework in the context of a DC-DC power convertor case study and investigate the computational complexity.

[52]  arXiv:2206.13385 [pdf, other]
Title: 3D unsupervised anomaly detection and localization through virtual multi-view projection and reconstruction: Clinical validation on low-dose chest computed tomography
Comments: Kyung-Su Kim and Seong Je Oh have contributed equally to this work as the co-first author. Kyung-Su Kim (kskim.doc@gmail.com) and Myung Jin Chung (mj1.chung@samsung.com) have contributed equally to this work as the co-corresponding author
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Computer-aided diagnosis for low-dose computed tomography (CT) based on deep learning has recently attracted attention as a first-line automatic testing tool because of its high accuracy and low radiation exposure. However, existing methods rely on supervised learning, imposing an additional burden to doctors for collecting disease data or annotating spatial labels for network training, consequently hindering their implementation. We propose a method based on a deep neural network for computer-aided diagnosis called virtual multi-view projection and reconstruction for unsupervised anomaly detection. Presumably, this is the first method that only requires data from healthy patients for training to identify three-dimensional (3D) regions containing any anomalies. The method has three key components. Unlike existing computer-aided diagnosis tools that use conventional CT slices as the network input, our method 1) improves the recognition of 3D lung structures by virtually projecting an extracted 3D lung region to obtain two-dimensional (2D) images from diverse views to serve as network inputs, 2) accommodates the input diversity gain for accurate anomaly detection, and 3) achieves 3D anomaly/disease localization through a novel 3D map restoration method using multiple 2D anomaly maps. The proposed method based on unsupervised learning improves the patient-level anomaly detection by 10% (area under the curve, 0.959) compared with a gold standard based on supervised learning (area under the curve, 0.848), and it localizes the anomaly region with 93% accuracy, demonstrating its high performance.

[53]  arXiv:2206.13393 [pdf, other]
Title: Cross-Modal Transformer GAN: A Brain Structure-Function Deep Fusing Framework for Alzheimer's Disease
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)

Cross-modal fusion of different types of neuroimaging data has shown great promise for predicting the progression of Alzheimer's Disease(AD). However, most existing methods applied in neuroimaging can not efficiently fuse the functional and structural information from multi-modal neuroimages. In this work, a novel cross-modal transformer generative adversarial network(CT-GAN) is proposed to fuse functional information contained in resting-state functional magnetic resonance imaging (rs-fMRI) and structural information contained in Diffusion Tensor Imaging (DTI). The developed bi-attention mechanism can match functional information to structural information efficiently and maximize the capability of extracting complementary information from rs-fMRI and DTI. By capturing the deep complementary information between structural features and functional features, the proposed CT-GAN can detect the AD-related brain connectivity, which could be used as a bio-marker of AD. Experimental results show that the proposed model can not only improve classification performance but also detect the AD-related brain connectivity effectively.

[54]  arXiv:2206.13394 [pdf, other]
Title: CS$^2$: A Controllable and Simultaneous Synthesizer of Images and Annotations with Minimal Human Intervention
Comments: 11 figures, Accepted by MICCAI 2022
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

The destitution of image data and corresponding expert annotations limit the training capacities of AI diagnostic models and potentially inhibit their performance. To address such a problem of data and label scarcity, generative models have been developed to augment the training datasets. Previously proposed generative models usually require manually adjusted annotations (e.g., segmentation masks) or need pre-labeling. However, studies have found that these pre-labeling based methods can induce hallucinating artifacts, which might mislead the downstream clinical tasks, while manual adjustment could be onerous and subjective. To avoid manual adjustment and pre-labeling, we propose a novel controllable and simultaneous synthesizer (dubbed CS$^2$) in this study to generate both realistic images and corresponding annotations at the same time. Our CS$^2$ model is trained and validated using high resolution CT (HRCT) data collected from COVID-19 patients to realize an efficient infections segmentation with minimal human intervention. Our contributions include 1) a conditional image synthesis network that receives both style information from reference CT images and structural information from unsupervised segmentation masks, and 2) a corresponding segmentation mask synthesis network to automatically segment these synthesized images simultaneously. Our experimental studies on HRCT scans collected from COVID-19 patients demonstrate that our CS$^2$ model can lead to realistic synthesized datasets and promising segmentation results of COVID infections compared to the state-of-the-art nnUNet trained and fine-tuned in a fully supervised manner.

[55]  arXiv:2206.13404 [pdf, other]
Title: Avocodo: Generative Adversarial Network for Artifact-free Vocoder
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)

Neural vocoders based on the generative adversarial neural network (GAN) have been widely used due to their fast inference speed and lightweight networks while generating high-quality speech waveforms. Since the perceptually important speech components are primarily concentrated in the low-frequency band, most of the GAN-based neural vocoders perform multi-scale analysis that evaluates downsampled speech waveforms. This multi-scale analysis helps the generator improve speech intelligibility. However, in preliminary experiments, we observed that the multi-scale analysis which focuses on the low-frequency band causes unintended artifacts, e.g., aliasing and imaging artifacts, and these artifacts degrade the synthesized speech waveform quality. Therefore, in this paper, we investigate the relationship between these artifacts and GAN-based neural vocoders and propose a GAN-based neural vocoder, called Avocodo, that allows the synthesis of high-fidelity speech with reduced artifacts. We introduce two kinds of discriminators to evaluate waveforms in various perspectives: a collaborative multi-band discriminator and a sub-band discriminator. We also utilize a pseudo quadrature mirror filter bank to obtain downsampled multi-band waveforms while avoiding aliasing. The experimental results show that Avocodo outperforms conventional GAN-based neural vocoders in both speech and singing voice synthesis tasks and can synthesize artifact-free speech. Especially, Avocodo is even capable to reproduce high-quality waveforms of unseen speakers.

[56]  arXiv:2206.13411 [pdf, other]
Title: Audio Similarity is Unreliable as a Proxy for Audio Quality
Comments: To Appear, Interspeech 2022
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

Many audio processing tasks require perceptual assessment. However, the time and expense of obtaining ``gold standard'' human judgments limit the availability of such data. Most applications incorporate full reference or other similarity-based metrics (e.g. PESQ) that depend on a clean reference. Researchers have relied on such metrics to evaluate and compare various proposed methods, often concluding that small, measured differences imply one is more effective than another. This paper demonstrates several practical scenarios where similarity metrics fail to agree with human perception, because they: (1) vary with clean references; (2) rely on attributes that humans factor out when considering quality, and (3) are sensitive to imperceptible signal level differences. In those scenarios, we show that no-reference metrics do not suffer from such shortcomings and correlate better with human perception. We conclude therefore that similarity serves as an unreliable proxy for audio quality.

[57]  arXiv:2206.13419 [pdf, other]
Title: DeStripe: A Self2Self Spatio-Spectral Graph Neural Network with Unfolded Hessian for Stripe Artifact Removal in Light-sheet Microscopy
Comments: Accepted by 25th International Conference on Medical Image Computing and Computer Assisted Intervention
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Light-sheet fluorescence microscopy (LSFM) is a cutting-edge volumetric imaging technique that allows for three-dimensional imaging of mesoscopic samples with decoupled illumination and detection paths. Although the selective excitation scheme of such a microscope provides intrinsic optical sectioning that minimizes out-of-focus fluorescence background and sample photodamage, it is prone to light absorption and scattering effects, which results in uneven illumination and striping artifacts in the images adversely. To tackle this issue, in this paper, we propose a blind stripe artifact removal algorithm in LSFM, called DeStripe, which combines a self-supervised spatio-spectral graph neural network with unfolded Hessian prior. Specifically, inspired by the desirable properties of Fourier transform in condensing striping information into isolated values in the frequency domain, DeStripe firstly localizes the potentially corrupted Fourier coefficients by exploiting the structural difference between unidirectional stripe artifacts and more isotropic foreground images. Affected Fourier coefficients can then be fed into a graph neural network for recovery, with a Hessian regularization unrolled to further ensure structures in the standard image space are well preserved. Since in realistic, stripe-free LSFM barely exists with a standard image acquisition protocol, DeStripe is equipped with a Self2Self denoising loss term, enabling artifact elimination without access to stripe-free ground truth images. Competitive experimental results demonstrate the efficacy of DeStripe in recovering corrupted biomarkers in LSFM with both synthetic and real stripe artifacts.

[58]  arXiv:2206.13420 [pdf, other]
Title: Unsupervised Voice Activity Detection by Modeling Source and System Information using Zero Frequency Filtering
Comments: Accepted at Interspeech 2022
Subjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)

Voice activity detection (VAD) is an important pre-processing step for speech technology applications. The task consists of deriving segment boundaries of audio signals which contain voicing information. In recent years, it has been shown that voice source and vocal tract system information can be extracted using zero-frequency filtering (ZFF) without making any explicit model assumptions about the speech signal. This paper investigates the potential of zero-frequency filtering for jointly modeling voice source and vocal tract system information, and proposes two approaches for VAD. The first approach demarcates voiced regions using a composite signal composed of different zero-frequency filtered signals. The second approach feeds the composite signal as input to the rVAD algorithm. These approaches are compared with other supervised and unsupervised VAD methods in the literature, and are evaluated on the Aurora-2 database, across a range of SNRs (20 to -5 dB). Our studies show that the proposed ZFF-based methods perform comparable to state-of-art VAD methods and are more invariant to added degradation and different channel characteristics.

[59]  arXiv:2206.13437 [pdf]
Title: A Generalized Probabilistic Monitoring Model with Both Random and Sequential Data
Comments: 12 pages, 4 figures, 3 tables
Subjects: Systems and Control (eess.SY)

Many multivariate statistical analysis methods and their corresponding probabilistic counterparts have been adopted to develop process monitoring models in recent decades. However, the insightful connections between them have rarely been studied. In this study, a generalized probabilistic monitoring model (GPMM) is developed with both random and sequential data. Since GPMM can be reduced to various probabilistic linear models under specific restrictions, it is adopted to analyze the connections between different monitoring methods. Using expectation maximization (EM) algorithm, the parameters of GPMM are estimated for both random and sequential cases. Based on the obtained model parameters, statistics are designed for monitoring different aspects of the process system. Besides, the distributions of these statistics are rigorously derived and proved, so that the control limits can be calculated accordingly. After that, contribution analysis methods are presented for identifying faulty variables once the process anomalies are detected. Finally, the equivalence between monitoring models based on classical multivariate methods and their corresponding probabilistic graphic models is further investigated. The conclusions of this study are verified using a numerical example and the Tennessee Eastman (TE) process. Experimental results illustrate that the proposed monitoring statistics are subject to their corresponding distributions, and they are equivalent to statistics in classical deterministic models under specific restrictions.

[60]  arXiv:2206.13443 [pdf, other]
Title: CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer
Comments: Accepted to be published in the Proceedings of InterSpeech 2022
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers. We do this by activating distinct parts of the network for different tasks. We train our model using a novel approach to two-stage training. In Stage I, the model learns speaker-independent word-level prosody representations from speech which it uses for many-to-many fine-grained prosody transfer. In Stage II, we learn to predict these prosody representations using the contextual information available in text, thereby, enabling multi-speaker TTS with contextually appropriate prosody. We compare CC2 to two strong baselines, one in TTS with contextually appropriate prosody, and one in fine-grained prosody transfer. CC2 reduces the gap in naturalness between our baseline and copy-synthesised speech by $22.79\%$. In fine-grained prosody transfer evaluations, it obtains a relative improvement of $33.15\%$ in target speaker similarity.

[61]  arXiv:2206.13455 [pdf, other]
Title: IBISCape: A Simulated Benchmark for multi-modal SLAM Systems Evaluation in Large-scale Dynamic Environments
Comments: Submitted to the Journal of Intelligent & Robotic Systems (JINT - Special Issue)
Subjects: Image and Video Processing (eess.IV); Robotics (cs.RO)

The development process of high-fidelity SLAM systems depends on their validation upon reliable datasets. Towards this goal, we propose IBISCape, a simulated benchmark that includes data synchronization and acquisition APIs for telemetry from heterogeneous sensors: stereo-RGB/DVS, Depth, IMU, and GPS, along with the ground truth scene segmentation and vehicle ego-motion. Our benchmark is built upon the CARLA simulator, whose back-end is the Unreal Engine rendering a high dynamic scenery simulating the real world. Moreover, we offer 34 multi-modal datasets suitable for autonomous vehicles navigation, including scenarios for scene understanding evaluation like accidents, along with a wide range of frame quality based on a dynamic weather simulation class integrated with our APIs. We also introduce the first calibration targets to CARLA maps to solve the unknown distortion parameters problem of CARLA simulated DVS and RGB cameras. Finally, using IBISCape sequences, we evaluate four ORB-SLAM3 systems (monocular RGB, stereo RGB, Stereo Visual Inertial (SVI), and RGB-D) performance and BASALT Visual-Inertial Odometry (VIO) system on various sequences collected in simulated large-scale dynamic environments.
Keywords: benchmark, multi-modal, datasets, Odometry, Calibration, DVS, SLAM

[62]  arXiv:2206.13483 [pdf, other]
Title: Optimized Decoding-Energy-Aware Encoding in Practical VVC Implementations
Subjects: Image and Video Processing (eess.IV)

The optimization of the energy demand is crucial for modern video codecs. Previous studies show that the energy demand of VVC decoders can be improved by more than 50% if specific coding tools are disabled in the encoder. However, those approaches increase the bit rate by over 20% if the concept is applied to practical encoder implementations such as VVenC. Therefore, in this work, we investigate VVenC and study possibilities to reduce the additional bit rate, while still achieving low-energy decoding at reasonable encoding times. We show that encoding using our proposed coding tool profiles, the decoding energy efficiency is improved by over 25% with a bit rate increase of less than 5% with respect to standard encoding. Furthermore, we propose a second coding tool profile targeting maximum energy savings, which achieves 34% of energy savings at bitrate increases below 15%.

Cross-lists for Tue, 28 Jun 22

[63]  arXiv:2206.12420 (cross-list from cs.LG) [pdf, other]
Title: SCAI: A Spectral data Classification framework with Adaptive Inference for the IoT platform
Comments: 14 pages,11 figures
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)

Currently, it is a hot research topic to realize accurate, efficient, and real-time identification of massive spectral data with the help of deep learning and IoT technology. Deep neural networks played a key role in spectral analysis. However, the inference of deeper models is performed in a static manner, and cannot be adjusted according to the device. Not all samples need to allocate all computation to reach confident prediction, which hinders maximizing the overall performance. To address the above issues, we propose a Spectral data Classification framework with Adaptive Inference. Specifically, to allocate different computations for different samples while better exploiting the collaboration among different devices, we leverage Early-exit architecture, place intermediate classifiers at different depths of the architecture, and the model outputs the results when the prediction confidence reaches a preset threshold. We propose a training paradigm of self-distillation learning, the deepest classifier performs soft supervision on the shallow ones to maximize their performance and training speed. At the same time, to mitigate the vulnerability of performance to the location and number settings of intermediate classifiers in the Early-exit paradigm, we propose a Position-Adaptive residual network. It can adjust the number of layers in each block at different curve positions, so it can focus on important positions of the curve (e.g.: Raman peak), and accurately allocate the appropriate computational budget based on task performance and computing resources. To the best of our knowledge, this paper is the first attempt to conduct optimization by adaptive inference for spectral detection under the IoT platform. We conducted many experiments, the experimental results show that our proposed method can achieve higher performance with less computational budget than existing methods.

[64]  arXiv:2206.12469 (cross-list from cs.SD) [pdf, other]
Title: Burst2Vec: An Adversarial Multi-Task Approach for Predicting Emotion, Age, and Origin from Vocal Bursts
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

We present Burst2Vec, our multi-task learning approach to predict emotion, age, and origin (i.e., native country/language) from vocal bursts. Burst2Vec utilises pre-trained speech representations to capture acoustic information from raw waveforms and incorporates the concept of model debiasing via adversarial training. Our models achieve a relative 30 % performance gain over baselines using pre-extracted features and score the highest amongst all participants in the ICML ExVo 2022 Multi-Task Challenge.

[65]  arXiv:2206.12471 (cross-list from cs.RO) [pdf, other]
Title: Interaction-Dynamics-Aware Perception Zones for Obstacle Detection Safety Evaluation
Comments: Accepted to Intelligent Vehicles Symposium 2022
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)

To enable safe autonomous vehicle (AV) operations, it is critical that an AV's obstacle detection module can reliably detect obstacles that pose a safety threat (i.e., are safety-critical). It is therefore desirable that the evaluation metric for the perception system captures the safety-criticality of objects. Unfortunately, existing perception evaluation metrics tend to make strong assumptions about the objects and ignore the dynamic interactions between agents, and thus do not accurately capture the safety risks in reality. To address these shortcomings, we introduce an interaction-dynamics-aware obstacle detection evaluation metric by accounting for closed-loop dynamic interactions between an ego vehicle and obstacles in the scene. By borrowing existing theory from optimal control theory, namely Hamilton-Jacobi reachability, we present a computationally tractable method for constructing a ``safety zone'': a region in state space that defines where safety-critical obstacles lie for the purpose of defining safety metrics. Our proposed safety zone is mathematically complete, and can be easily computed to reflect a variety of safety requirements. Using an off-the-shelf detection algorithm from the nuScenes detection challenge leaderboard, we demonstrate that our approach is computationally lightweight, and can better capture safety-critical perception errors than a baseline approach.

[66]  arXiv:2206.12480 (cross-list from cs.CV) [pdf, other]
Title: Attention-Guided Autoencoder for Automated Progression Prediction of Subjective Cognitive Decline with Structural MRI
Comments: 10 pages, 9 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Subjective cognitive decline (SCD) is a preclinical stage of Alzheimer's disease (AD) which occurs even before mild cognitive impairment (MCI). Progressive SCD will convert to MCI with the potential of further evolving to AD. Therefore, early identification of progressive SCD with neuroimaging techniques (e.g., structural MRI) is of great clinical value for early intervention of AD. However, existing MRI-based machine/deep learning methods usually suffer the small-sample-size problem which poses a great challenge to related neuroimaging analysis. The central question we aim to tackle in this paper is how to leverage related domains (e.g., AD/NC) to assist the progression prediction of SCD. Meanwhile, we are concerned about which brain areas are more closely linked to the identification of progressive SCD. To this end, we propose an attention-guided autoencoder model for efficient cross-domain adaptation which facilitates the knowledge transfer from AD to SCD. The proposed model is composed of four key components: 1) a feature encoding module for learning shared subspace representations of different domains, 2) an attention module for automatically locating discriminative brain regions of interest defined in brain atlases, 3) a decoding module for reconstructing the original input, 4) a classification module for identification of brain diseases. Through joint training of these four modules, domain invariant features can be learned. Meanwhile, the brain disease related regions can be highlighted by the attention mechanism. Extensive experiments on the publicly available ADNI dataset and a private CLAS dataset have demonstrated the effectiveness of the proposed method. The proposed model is straightforward to train and test with only 5-10 seconds on CPUs and is suitable for medical tasks with small datasets.

[67]  arXiv:2206.12484 (cross-list from cs.LG) [pdf, other]
Title: A Novel Approach For Analysis of Distributed Acoustic Sensing System Based on Deep Transfer Learning
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Distributed acoustic sensors (DAS) are effective apparatus which are widely used in many application areas for recording signals of various events with very high spatial resolution along the optical fiber. To detect and recognize the recorded events properly, advanced signal processing algorithms with high computational demands are crucial. Convolutional neural networks are highly capable tools for extracting spatial information and very suitable for event recognition applications in DAS. Long-short term memory (LSTM) is an effective instrument for processing sequential data. In this study, we proposed a multi-input multi-output, two stage feature extraction methodology that combines the capabilities of these neural network architectures with transfer learning to classify vibrations applied to an optical fiber by a piezo transducer. First, we extracted the differential amplitude and phase information from the Phase-OTDR recordings and stored them in a temporal-spatial data matrix. Then, we used a state-of-the-art pre-trained CNN without dense layers as a feature extractor in the first stage. In the second stage, we used LSTMs to further analyze the features extracted by the CNN. Finally, we used a dense layer to classify the extracted features. To observe the effect of the utilized CNN architecture, we tested our model with five state-of-the art pre-trained models (VGG-16, ResNet-50, DenseNet-121, MobileNet and Inception-v3). The results show that using the VGG-16 architecture in our framework manages to obtain 100% classification accuracy in 50 trainings and got the best results on our Phase-OTDR dataset. Outcomes of this study indicate that the pre-trained CNNs combined with LSTM are very suitable for the analysis of differential amplitude and phase information, represented in a temporal spatial data matrix which is promising for event recognition operations in DAS applications.

[68]  arXiv:2206.12494 (cross-list from cs.SD) [pdf, other]
Title: Multitask vocal burst modeling with ResNets and pre-trained paralinguistic Conformers
Comments: To be published in the ICML Expressive Vocalizations Workshop & Competition 2022 (this https URL)
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

This technical report presents the modeling approaches used in our submission to the ICML Expressive Vocalizations Workshop & Competition multitask track (ExVo-MultiTask). We first applied image classification models of various sizes on mel-spectrogram representations of the vocal bursts, as is standard in sound event detection literature. Results from these models show an increase of 21.24% over the baseline system with respect to the harmonic mean of the task metrics, and comprise our team's main submission to the MultiTask track. We then sought to characterize the headroom in the MultiTask track by applying a large pre-trained Conformer model that previously achieved state-of-the-art results on paralinguistic tasks like speech emotion recognition and mask detection. We additionally investigated the relationship between the sub-tasks of emotional expression, country of origin, and age prediction, and discovered that the best performing models are trained as single-task models, questioning whether the problem truly benefits from a multitask setting.

[69]  arXiv:2206.12507 (cross-list from physics.plasm-ph) [pdf, other]
Title: Electromagnetic Non-Reciprocity in a Magnetized Plasma Circulator
Subjects: Plasma Physics (physics.plasm-ph); Signal Processing (eess.SP)

Non-reciprocal transport of electromagnetic waves within magnetized plasma is a powerful building block towards understanding and exploiting the properties of more general topological systems. Much recent attention has been paid to the theoretical issues of wave interaction within such a medium, but there is a lack of experimental verification that such systems can be viable in a lab or industrial setting. This work provides an experimental proof-of-concept by demonstrating non-reciprocity in a unit component, a microwave plasma circulator. We design an E-plane Y junction plasma circulator operating in the range of 4 to 6 GHz using standardized waveguide specifications. From both simulations and experiments, we observe wide band isolation for the power transmission through the circulator. The performance and the frequency band of the circulator can be easily tuned by changing the plasma density and the magnetic field strength. By linking simulations and experimental results, we estimate the plasma density for the device.

[70]  arXiv:2206.12513 (cross-list from cs.SD) [pdf, other]
Title: Domain Generalization with Relaxed Instance Frequency-wise Normalization for Multi-device Acoustic Scene Classification
Comments: Proceedings of INTERSPEECH 2022
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

While using two-dimensional convolutional neural networks (2D-CNNs) in image processing, it is possible to manipulate domain information using channel statistics, and instance normalization has been a promising way to get domain-invariant features. However, unlike image processing, we analyze that domain-relevant information in an audio feature is dominant in frequency statistics rather than channel statistics. Motivated by our analysis, we introduce Relaxed Instance Frequency-wise Normalization (RFN): a plug-and-play, explicit normalization module along the frequency axis which can eliminate instance-specific domain discrepancy in an audio feature while relaxing undesirable loss of useful discriminative information. Empirically, simply adding RFN to networks shows clear margins compared to previous domain generalization approaches on acoustic scene classification and yields improved robustness for multiple audio devices. Especially, the proposed RFN won the DCASE2021 challenge TASK1A, low-complexity acoustic scene classification with multiple devices, with a clear margin, and RFN is an extended work of our technical report.

[71]  arXiv:2206.12523 (cross-list from cs.IT) [pdf, ps, other]
Title: MMSE Symbol Level Precoding Under a Per Antenna Power Constraint for Multiuser MIMO Systems With PSK Modulation
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)

This study proposes a symbol-level precoding algorithm based on the minimum mean squared error design objective under a strict per antenna power constraint for PSK modulation. The proposed design is then formulated in the standard form of a second-order cone program, allowing for an optimal solution via the interior point method. Numerical results indicate that the proposed design is superior to the existing approaches in terms of bit-error-rate for the low and intermediate SNR regime.

[72]  arXiv:2206.12559 (cross-list from cs.SD) [pdf, other]
Title: Self-supervised Context-aware Style Representation for Expressive Speech Synthesis
Comments: Accepted by Interspeech 2022
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

Expressive speech synthesis, like audiobook synthesis, is still challenging for style representation learning and prediction. Deriving from reference audio or predicting style tags from text requires a huge amount of labeled data, which is costly to acquire and difficult to define and annotate accurately. In this paper, we propose a novel framework for learning style representation from abundant plain text in a self-supervised manner. It leverages an emotion lexicon and uses contrastive learning and deep clustering. We further integrate the style representation as a conditioned embedding in a multi-style Transformer TTS. Comparing with multi-style TTS by predicting style tags trained on the same dataset but with human annotations, our method achieves improved results according to subjective evaluations on both in-domain and out-of-domain test sets in audiobook speech. Moreover, with implicit context-aware style representation, the emotion transition of synthesized audio in a long paragraph appears more natural. The audio samples are available on the demo web.

[73]  arXiv:2206.12563 (cross-list from cs.SD) [pdf, other]
Title: Generating Diverse Vocal Bursts with StyleGAN2 and MEL-Spectrograms
Comments: To be published at the ICML Expressive Vocalizations Workshop and Competition (ExVo Generate) held in conjunction with the 39th International Conference on Machine Learning
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

We describe our approach for the generative emotional vocal burst task (ExVo Generate) of the ICML Expressive Vocalizations Competition. We train a conditional StyleGAN2 architecture on mel-spectrograms of preprocessed versions of the audio samples. The mel-spectrograms generated by the model are then inverted back to the audio domain. As a result, our generated samples substantially improve upon the baseline provided by the competition from a qualitative and quantitative perspective for all emotions. More precisely, even for our worst-performing emotion (awe), we obtain an FAD of 1.76 compared to the baseline of 4.81 (as a reference, the FAD between the train/validation sets for awe is 0.776).

[74]  arXiv:2206.12568 (cross-list from cs.SD) [pdf, other]
Title: Self-supervision and Learnable STRFs for Age, Emotion, and Country Prediction
Journal-ref: Proceedings of the 39th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

This work presents a multitask approach to the simultaneous estimation of age, country of origin, and emotion given vocal burst audio for the 2022 ICML Expressive Vocalizations Challenge ExVo-MultiTask track. The method of choice utilized a combination of spectro-temporal modulation and self-supervised features, followed by an encoder-decoder network organized in a multitask paradigm. We evaluate the complementarity between the tasks posed by examining independent task-specific and joint models, and explore the relative strengths of different feature sets. We also introduce a simple score fusion mechanism to leverage the complementarity of different feature sets for this task.
We find that robust data preprocessing in conjunction with score fusion over spectro-temporal receptive field and HuBERT models achieved our best ExVo-MultiTask test score of 0.412.

[75]  arXiv:2206.12638 (cross-list from cs.CL) [pdf, other]
Title: Distilling a Pretrained Language Model to a Multilingual ASR Model
Comments: Accepted to Interspeech 2022. Official implementation provided in this https URL
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Multilingual speech data often suffer from long-tailed language distribution, resulting in performance degradation. However, multilingual text data is much easier to obtain, yielding a more useful general language model. Hence, we are motivated to distill the rich knowledge embedded inside a well-trained teacher text model to the student speech model. We propose a novel method called the Distilling a Language model to a Speech model (Distill-L2S), which aligns the latent representations of two different modalities. The subtle differences are handled by the shrinking mechanism, nearest-neighbor interpolation, and a learnable linear projection layer. We demonstrate the effectiveness of our distillation method by applying it to the multilingual automatic speech recognition (ASR) task. We distill the transformer-based cross-lingual language model (InfoXLM) while fine-tuning the large-scale multilingual ASR model (XLSR-wav2vec 2.0) for each language. We show the superiority of our method on 20 low-resource languages of the CommonVoice dataset with less than 100 hours of speech data.

[76]  arXiv:2206.12662 (cross-list from cs.SD) [pdf, other]
Title: Synthesizing Personalized Non-speech Vocalization from Discrete Speech Representations
Authors: Chin-Cheng Hsu
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

We formulated non-speech vocalization (NSV) modeling as a text-to-speech task and verified its viability. Specifically, we evaluated the phonetic expressivity of HUBERT speech units on NSVs and verified our model's ability to control over speaker timbre even though the training data is speaker few-shot. In addition, we substantiated that the heterogeneity in recording conditions is the major obstacle for NSV modeling. Finally, we discussed five improvements over our method for future research. Audio samples of synthesized NSVs are available on our demo page: https://resemble-ai.github.io/reLaugh.

[77]  arXiv:2206.12693 (cross-list from cs.CL) [pdf, other]
Title: TEVR: Improving Speech Recognition by Token Entropy Variance Reduction
Comments: 10 pages including 2 pages appendix, 1 figure, 6 tables
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

This paper presents TEVR, a speech recognition model designed to minimize the variation in token entropy w.r.t. to the language model. This takes advantage of the fact that if the language model will reliably and accurately predict a token anyway, then the acoustic model doesn't need to be accurate in recognizing it. We train German ASR models with 900 million parameters and show that on CommonVoice German, TEVR scores a very competitive 3.64% word error rate, which outperforms the best reported results by a relative 16.89% reduction in word error rate. We hope that releasing our fully trained speech recognition pipeline to the community will lead to privacy-preserving offline virtual assistants in the future.

[78]  arXiv:2206.12759 (cross-list from cs.CL) [pdf, other]
Title: Low-resource Accent Classification in Geographically-proximate Settings: A Forensic and Sociophonetics Perspective
Comments: INTERSPEECH 2022
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Accented speech recognition and accent classification are relatively under-explored research areas in speech technology. Recently, deep learning-based methods and Transformer-based pretrained models have achieved superb performances in both areas. However, most accent classification tasks focused on classifying different kinds of English accents and little attention was paid to geographically-proximate accent classification, especially under a low-resource setting where forensic speech science tasks usually encounter. In this paper, we explored three main accent modelling methods combined with two different classifiers based on 105 speaker recordings retrieved from five urban varieties in Northern England. Although speech representations generated from pretrained models generally have better performances in downstream classification, traditional methods like Mel Frequency Cepstral Coefficients (MFCCs) and formant measurements are equipped with specific strengths. These results suggest that in forensic phonetics scenario where data are relatively scarce, a simple modelling method and classifier could be competitive with state-of-the-art pretrained speech models as feature extractors, which could enhance a sooner estimation for the accent information in practices. Besides, our findings also cross-validated a new methodology in quantifying sociophonetic changes.

[79]  arXiv:2206.12772 (cross-list from cs.CV) [pdf, other]
Title: Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation
Comments: 10 pages,
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)

We present a simple yet effective self-supervised framework for audio-visual representation learning, to localize the sound source in videos. To understand what enables to learn useful representations, we systematically investigate the effects of data augmentations, and reveal that (1) composition of data augmentations plays a critical role, {\em i.e.}~explicitly encouraging the audio-visual representations to be invariant to various transformations~({\em transformation invariance}); (2) enforcing geometric consistency substantially improves the quality of learned representations, {\em i.e.}~the detected sound source should follow the same transformation applied on input video frames~({\em transformation equivariance}). Extensive experiments demonstrate that our model significantly outperforms previous methods on two sound localization benchmarks, namely, Flickr-SoundNet and VGG-Sound. Additionally, we also evaluate audio retrieval and cross-modal retrieval tasks. In both cases, our self-supervised models demonstrate superior retrieval performances, even competitive with the supervised approach in audio retrieval. This reveals the proposed framework learns strong multi-modal representations that are beneficial to sound localisation and generalization to further applications. \textit{All codes will be available}.

[80]  arXiv:2206.12822 (cross-list from cs.IT) [pdf, other]
Title: Channel Estimation and Signal Detection for MIMO-AFDM under Doubly Selective Channels
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)

On the heels of orthogonal time frequency space (OTFS) modulation, the recently discovered affine frequency division multiplexing (AFDM) is a promising waveform for the sixth-generation wireless network due to its strong delay-doppler resilience against the double dispersive channels. With the superiorities of high multiplexing and diversity gain of multiple-input multiple-output (MIMO), we derive a vectorized input-output formulation of the MIMO-AFDM system. Correspondingly, we also propose an efficient single pilot aided with minimum guard (SPA-MG) scheme to perform channel estimation in the discrete affine Fourier transform (DAFT) domain. Furthermore, the message passing based iterative detector is explored for signal detection. Finally, the bit error ratio (BER) performances are simulated under doubly selective channels. It is worth emphasizing that the MIMO-AFDM system can achieve outstanding performance similar to MIMO-OTFS. Additionally, compared to ideal channel state information, our proposed SPA-MG scheme is verified to provide marginal difference with the least overhead.

[81]  arXiv:2206.12829 (cross-list from cs.SD) [pdf, other]
Title: On Comparison of Encoders for Attention based End to End Speech Recognition in Standalone and Rescoring Mode
Comments: Accepted at SPCOM 2022
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

The streaming automatic speech recognition (ASR) models are more popular and suitable for voice-based applications. However, non-streaming models provide better performance as they look at the entire audio context. To leverage the benefits of the non-streaming model in streaming applications like voice search, it is commonly used in second pass re-scoring mode. The candidate hypothesis generated using steaming models is re-scored using a non-streaming model. In this work, we evaluate the non-streaming attention-based end-to-end ASR models on the Flipkart voice search task in both standalone and re-scoring modes. These models are based on Listen-Attend-Spell (LAS) encoder-decoder architecture. We experiment with different encoder variations based on LSTM, Transformer, and Conformer. We compare the latency requirements of these models along with their performance. Overall we show that the Transformer model offers acceptable WER with the lowest latency requirements. We report a relative WER improvement of around 16% with the second pass LAS re-scoring with latency overhead under 5ms. We also highlight the importance of CNN front-end with Transformer architecture to achieve comparable word error rates (WER). Moreover, we observe that in the second pass re-scoring mode all the encoders provide similar benefits whereas the difference in performance is prominent in standalone text generation mode.

[82]  arXiv:2206.12879 (cross-list from cs.CL) [pdf, ps, other]
Title: Data Augmentation for Dementia Detection in Spoken Language
Comments: Accepted to INTERSPEECH 2022
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Dementia is a growing problem as our society ages, and detection methods are often invasive and expensive. Recent deep-learning techniques can offer a faster diagnosis and have shown promising results. However, they require large amounts of labelled data which is not easily available for the task of dementia detection. One effective solution to sparse data problems is data augmentation, though the exact methods need to be selected carefully. To date, there has been no empirical study of data augmentation on Alzheimer's disease (AD) datasets for NLP and speech processing. In this work, we investigate data augmentation techniques for the task of AD detection and perform an empirical evaluation of the different approaches on two kinds of models for both the text and audio domains. We use a transformer-based model for both domains, and SVM and Random Forest models for the text and audio domains, respectively. We generate additional samples using traditional as well as deep learning based methods and show that data augmentation improves performance for both the text- and audio-based models and that such results are comparable to state-of-the-art results on the popular ADReSS set, with carefully crafted architectures and features.

[83]  arXiv:2206.12914 (cross-list from cs.CV) [pdf, other]
Title: Video Anomaly Detection via Prediction Network with Enhanced Spatio-Temporal Memory Exchange
Comments: Accepted at ICASSP 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

Video anomaly detection is a challenging task because most anomalies are scarce and non-deterministic. Many approaches investigate the reconstruction difference between normal and abnormal patterns, but neglect that anomalies do not necessarily correspond to large reconstruction errors. To address this issue, we design a Convolutional LSTM Auto-Encoder prediction framework with enhanced spatio-temporal memory exchange using bi-directionalilty and a higher-order mechanism. The bi-directional structure promotes learning the temporal regularity through forward and backward predictions. The unique higher-order mechanism further strengthens spatial information interaction between the encoder and the decoder. Considering the limited receptive fields in Convolutional LSTMs, we also introduce an attention module to highlight informative features for prediction. Anomalies are eventually identified by comparing the frames with their corresponding predictions. Evaluations on three popular benchmarks show that our framework outperforms most existing prediction-based anomaly detection methods.

[84]  arXiv:2206.12928 (cross-list from cs.LG) [pdf, other]
Title: Learning neural state-space models: do we need a state estimator?
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)

In recent years, several algorithms for system identification with neural state-space models have been introduced. Most of the proposed approaches are aimed at reducing the computational complexity of the learning problem, by splitting the optimization over short sub-sequences extracted from a longer training dataset. Different sequences are then processed simultaneously within a minibatch, taking advantage of modern parallel hardware for deep learning. An issue arising in these methods is the need to assign an initial state for each of the sub-sequences, which is required to run simulations and thus to evaluate the fitting loss. In this paper, we provide insights for calibration of neural state-space training algorithms based on extensive experimentation and analyses performed on two recognized system identification benchmarks. Particular focus is given to the choice and the role of the initial state estimation. We demonstrate that advanced initial state estimation techniques are really required to achieve high performance on certain classes of dynamical systems, while for asymptotically stable ones basic procedures such as zero or random initialization already yield competitive performance.

[85]  arXiv:2206.12930 (cross-list from cs.CV) [pdf, other]
Title: SVBR-NET: A Non-Blind Spatially Varying Defocus Blur Removal Network
Comments: Accepted to ICIP2022
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Defocus blur is a physical consequence of the optical sensors used in most cameras. Although it can be used as a photographic style, it is commonly viewed as an image degradation modeled as the convolution of a sharp image with a spatially-varying blur kernel. Motivated by the advance of blur estimation methods in the past years, we propose a non-blind approach for image deblurring that can deal with spatially-varying kernels. We introduce two encoder-decoder sub-networks that are fed with the blurry image and the estimated blur map, respectively, and produce as output the deblurred (deconvolved) image. Each sub-network presents several skip connections that allow data propagation from layers spread apart, and also inter-subnetwork skip connections that ease the communication between the modules. The network is trained with synthetically blur kernels that are augmented to emulate blur maps produced by existing blur estimation methods, and our experimental results show that our method works well when combined with a variety of blur estimation methods.

[86]  arXiv:2206.12931 (cross-list from cs.CL) [pdf, ps, other]
Title: Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi
Comments: Speech for Social Good Workshop, 2022, Interspeech 2022
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

In this paper we discuss an in-progress work on the development of a speech corpus for four low-resource Indo-Aryan languages -- Awadhi, Bhojpuri, Braj and Magahi using the field methods of linguistic data collection. The total size of the corpus currently stands at approximately 18 hours (approx. 4-5 hours each language) and it is transcribed and annotated with grammatical information such as part-of-speech tags, morphological features and Universal dependency relationships. We discuss our methodology for data collection in these languages, most of which was done in the middle of the COVID-19 pandemic, with one of the aims being to generate some additional income for low-income groups speaking these languages. In the paper, we also discuss the results of the baseline experiments for automatic speech recognition system in these languages.

[87]  arXiv:2206.12947 (cross-list from cs.HC) [pdf, other]
Title: Improved Processing of Ultrasound Tongue Videos by Combining ConvLSTM and 3D Convolutional Networks
Comments: 10 pages, 4 tables, 2 figures, conference
Subjects: Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV)

Silent Speech Interfaces aim to reconstruct the acoustic signal from a sequence of ultrasound tongue images that records the articulatory movement. The extraction of information about the tongue movement requires us to efficiently process the whole sequence of images, not just as a single image. Several approaches have been suggested to process such a sequential image data. The classic neural network structure combines two-dimensional convolutional (2D-CNN) layers that process the images separately with recurrent layers (eg. an LSTM) on top of them to fuse the information along time. More recently, it was shown that one may also apply a 3D-CNN network that can extract information along both the spatial and the temporal axes in parallel, achieving a similar accuracy while being less time consuming. A third option is to apply the less well-known ConvLSTM layer type, which combines the advantages of LSTM and CNN layers by replacing matrix multiplication with the convolution operation. In this paper, we experimentally compared various combinations of the above mentions layer types for a silent speech interface task, and we obtained the best result with a hybrid model that consists of a combination of 3D-CNN and ConvLSTM layers. This hybrid network is slightly faster, smaller and more accurate than our previous 3D-CNN model. %with combination of (2+1)D CNN.

[88]  arXiv:2206.12955 (cross-list from cs.CL) [pdf, other]
Title: Improving the Training Recipe for a Robust Conformer-based Hybrid Model
Comments: Accepted at INTERSPEECH 2022
Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)

Speaker adaptation is important to build robust automatic speech recognition (ASR) systems. In this work, we investigate various methods for speaker adaptive training (SAT) based on feature-space approaches for a conformer-based acoustic model (AM) on the Switchboard 300h dataset. We propose a method, called Weighted-Simple-Add, which adds weighted speaker information vectors to the input of the multi-head self-attention module of the conformer AM. Using this method for SAT, we achieve 3.5% and 4.5% relative improvement in terms of WER on the CallHome part of Hub5'00 and Hub5'01 respectively. Moreover, we build on top of our previous work where we proposed a novel and competitive training recipe for a conformer-based hybrid AM. We extend and improve this recipe where we achieve 11% relative improvement in terms of word-error-rate (WER) on Switchboard 300h Hub5'00 dataset. We also make this recipe efficient by reducing the total number of parameters by 34% relative.

[89]  arXiv:2206.13021 (cross-list from cs.SD) [pdf, other]
Title: Speak Like a Professional: Increasing Speech Intelligibility by Mimicking Professional Announcer Voice with Voice Conversion
Comments: Accepted at INTERSPEECH 2022
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

In most of practical scenarios, the announcement system must deliver speech messages in a noisy environment, in which the background noise cannot be cancelled out. The local noise reduces speech intelligibility and increases listening effort of the listener, hence hamper the effectiveness of announcement system. There has been reported that voices of professional announcers are clearer and more comprehensive than that of non-expert speakers in noisy environment. This finding suggests that the speech intelligibility might be related to the speaking style of professional announcer, which can be adapted using voice conversion method. Motivated by this idea, this paper proposes a speech intelligibility enhancement in noisy environment by applying voice conversion method on non-professional voice. We discovered that the professional announcers and non-professional speakers are clusterized into different clusters on the speaker embedding plane. This implies that the speech intelligibility can be controlled as an independent feature of speaker individuality. To examine the advantage of converted voice in noisy environment, we experimented using test words masked in pink noise at different SNR levels. The results of objective and subjective evaluations confirm that the speech intelligibility of converted voice is higher than that of original voice in low SNR conditions.

[90]  arXiv:2206.13042 (cross-list from cs.CV) [pdf, other]
Title: A Strategy Optimized Pix2pix Approach for SAR-to-Optical Image Translation Task
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

This paper presented a state-of-the-art framework, Time Gated Convolutional Neural Network (TGCNN) that takes advantage of temporal information and gating mechanisms for the crop classification problem. Besides, several vegetation indices were constructed to expand dimensions of input data to take advantage of spectral information. Both spatial (channel-wise) and temporal (step-wise) correlation are considered in TGCNN. Specifically, our preliminary analysis indicates that step-wise information is of greater importance in this data set. Lastly, the gating mechanism helps capture high-order relationship. Our TGCNN solution achieves $0.973$ F1 score, $0.977$ AUC ROC and $0.948$ IoU, respectively. In addition, it outperforms three other benchmarks in different local tasks (Kenya, Brazil and Togo). Overall, our experiments demonstrate that TGCNN is advantageous in this earth observation time series classification task.

[91]  arXiv:2206.13071 (cross-list from cs.SD) [pdf, other]
Title: Uncertainty Calibration for Deep Audio Classifiers
Comments: Accepted by InterSpeech 2022, the first two authors contributed equally
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

Although deep Neural Networks (DNNs) have achieved tremendous success in audio classification tasks, their uncertainty calibration are still under-explored. A well-calibrated model should be accurate when it is certain about its prediction and indicate high uncertainty when it is likely to be inaccurate. In this work, we investigate the uncertainty calibration for deep audio classifiers. In particular, we empirically study the performance of popular calibration methods: (i) Monte Carlo Dropout, (ii) ensemble, (iii) focal loss, and (iv) spectral-normalized Gaussian process (SNGP), on audio classification datasets. To this end, we evaluate (i-iv) for the tasks of environment sound and music genre classification. Results indicate that uncalibrated deep audio classifiers may be over-confident, and SNGP performs the best and is very efficient on the two datasets of this paper.

[92]  arXiv:2206.13085 (cross-list from cs.SD) [pdf, other]
Title: Sound Model Factory: An Integrated System Architecture for Generative Audio Modelling
Journal-ref: International Conference on Computational Intelligence in Music, Sound, Art and Design (Part of EvoStar) (pp. 308-322). Springer, Cham. 2022
Subjects: Sound (cs.SD); Neural and Evolutionary Computing (cs.NE); Audio and Speech Processing (eess.AS)

We introduce a new system for data-driven audio sound model design built around two different neural network architectures, a Generative Adversarial Network(GAN) and a Recurrent Neural Network (RNN), that takes advantage of the unique characteristics of each to achieve the system objectives that neither is capable of addressing alone. The objective of the system is to generate interactively controllable sound models given (a) a range of sounds the model should be able to synthesize, and (b) a specification of the parametric controls for navigating that space of sounds. The range of sounds is defined by a dataset provided by the designer, while the means of navigation is defined by a combination of data labels and the selection of a sub-manifold from the latent space learned by the GAN. Our proposed system takes advantage of the rich latent space of a GAN that consists of sounds that fill out the spaces ''between" real data-like sounds. This augmented data from the GAN is then used to train an RNN for its ability to respond immediately and continuously to parameter changes and to generate audio over unlimited periods of time. Furthermore, we develop a self-organizing map technique for ``smoothing" the latent space of GAN that results in perceptually smooth interpolation between audio timbres. We validate this process through user studies. The system contributes advances to the state of the art for generative sound model design that include system configuration and components for improving interpolation and the expansion of audio modeling capabilities beyond musical pitch and percussive instrument sounds into the more complex space of audio textures.

[93]  arXiv:2206.13101 (cross-list from cs.SD) [pdf, other]
Title: SpeechEQ: Speech Emotion Recognition based on Multi-scale Unified Datasets and Multitask Learning
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

Speech emotion recognition (SER) has many challenges, but one of the main challenges is that each framework does not have a unified standard. In this paper, we propose SpeechEQ, a framework for unifying SER tasks based on a multi-scale unified metric. This metric can be trained by Multitask Learning (MTL), which includes two emotion recognition tasks of Emotion States Category (EIS) and Emotion Intensity Scale (EIS), and two auxiliary tasks of phoneme recognition and gender recognition. For this framework, we build a Mandarin SER dataset - SpeechEQ Dataset (SEQD). We conducted experiments on the public CASIA and ESD datasets in Mandarin, which exhibit that our method outperforms baseline methods by a relatively large margin, yielding 8.0\% and 6.5\% improvement in accuracy respectively. Additional experiments on IEMOCAP with four emotion categories (i.e., angry, happy, sad, and neutral) also show the proposed method achieves a state-of-the-art of both weighted accuracy (WA) of 78.16% and unweighted accuracy (UA) of 77.47%.

[94]  arXiv:2206.13110 (cross-list from cs.SD) [pdf, other]
Title: Sequence-level Speaker Change Detection with Difference-based Continuous Integrate-and-fire
Comments: Signal Processing Letters 2022
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

Speaker change detection is an important task in multi-party interactions such as meetings and conversations. In this paper, we address the speaker change detection task from the perspective of sequence transduction. Specifically, we propose a novel encoder-decoder framework that directly converts the input feature sequence to the speaker identity sequence. The difference-based continuous integrate-and-fire mechanism is designed to support this framework. It detects speaker changes by integrating the speaker difference between the encoder outputs frame-by-frame and transfers encoder outputs to segment-level speaker embeddings according to the detected speaker changes. The whole framework is supervised by the speaker identity sequence, a weaker label than the precise speaker change points. The experiments on the AMI and DIHARD-I corpora show that our sequence-level method consistently outperforms a strong frame-level baseline that uses the precise speaker change labels.

[95]  arXiv:2206.13135 (cross-list from cs.CL) [pdf]
Title: TALCS: An Open-Source Mandarin-English Code-Switching Corpus and a Speech Recognition Baseline
Comments: accepted by INTERSPEECH 2022
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

This paper introduces a new corpus of Mandarin-English code-switching speech recognition--TALCS corpus, suitable for training and evaluating code-switching speech recognition systems. TALCS corpus is derived from real online one-to-one English teaching scenes in TAL education group, which contains roughly 587 hours of speech sampled at 16 kHz. To our best knowledge, TALCS corpus is the largest well labeled Mandarin-English code-switching open source automatic speech recognition (ASR) dataset in the world. In this paper, we will introduce the recording procedure in detail, including audio capturing devices and corpus environments. And the TALCS corpus is freely available for download under the permissive license1. Using TALCS corpus, we conduct ASR experiments in two popular speech recognition toolkits to make a baseline system, including ESPnet and Wenet. The Mixture Error Rate (MER) performance in the two speech recognition toolkits is compared in TALCS corpus. The experimental results implies that the quality of audio recordings and transcriptions are promising and the baseline system is workable.

[96]  arXiv:2206.13136 (cross-list from cs.SD) [pdf]
Title: A two-stage full-band speech enhancement model with effective spectral compression mapping
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

The direct expansion of deep neural network (DNN) based wide-band speech enhancement (SE) to full-band processing faces the challenge of low frequency resolution in low frequency range, which would highly likely lead to deteriorated performance of the model. In this paper, we propose a learnable spectral compression mapping (SCM) to effectively compress the high frequency components so that they can be processed in a more efficient manner. By doing so, the model can pay more attention to low and middle frequency range, where most of the speech power is concentrated. Instead of suppressing noise in a single network structure, we first estimate a spectral magnitude mask, converting the speech to a high signal-to-ratio (SNR) state, and then utilize a subsequent model to further optimize the real and imaginary mask of the pre-enhanced signal. We conduct comprehensive experiments to validate the efficacy of the proposed method.

[97]  arXiv:2206.13307 (cross-list from cs.IT) [pdf, ps, other]
Title: Robust and Secure Resource Allocation for ISAC Systems: A Novel Optimization Framework for Variable-Length Snapshots
Comments: 30 pages, 11 figures
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)

In this paper, we investigate the robust resource allocation design for secure communication in an integrated sensing and communication (ISAC) system. A multi-antenna dual-functional radar-communication (DFRC) base station (BS) serves multiple single-antenna legitimate users and senses for targets simultaneously, where already identified targets are treated as potential single-antenna eavesdroppers. The DFRC BS scans a sector with a sequence of dedicated beams, and the ISAC system takes a snapshot of the environment during the transmission of each beam. Based on the sensing information, the DFRC BS can acquire the channel state information (CSI) of the potential eavesdroppers. Different from existing works that focused on the resource allocation design for a single snapshot, in this paper, we propose a novel optimization framework that jointly optimizes the communication and sensing resources over a sequence of snapshots with adjustable durations. To this end, we jointly optimize the duration of each snapshot, the beamforming vector, and the covariance matrix of the AN for maximization of the system sum secrecy rate over a sequence of snapshots while guaranteeing a minimum required average achievable rate and a maximum information leakage constraint for each legitimate user. The resource allocation algorithm design is formulated as a non-convex optimization problem, where we account for the imperfect CSI of both the legitimate users and the potential eavesdroppers. To make the problem tractable, we derive a bound for the uncertainty region of the potential eavesdroppers' small-scale fading based on a safe approximation, which facilitates the development of a block coordinate descent-based iterative algorithm for obtaining an efficient suboptimal solution.

[98]  arXiv:2206.13314 (cross-list from math.OC) [pdf, ps, other]
Title: Dimension-Free Matrix Spaces
Authors: Daizhan Cheng
Subjects: Optimization and Control (math.OC); Systems and Control (eess.SY)

Based on various types of semi-tensor products of matrices, the corresponding equivalences of matrices are proposed. Then the corresponding vector space structures are obtained as the quotient spaces under equivalences, which are called the dimension-free Matrix spaces (DFESs). Certain structures and properties are investigated. Finaly, the Lie bracket structure of general linear algebra is extended to DfMSs to make them Lie algebras, called dimension-free general linear algebra (DFGLA). Inspire of the fact that the DFGLAs are of infinite dimension, they have most properties of finite dimensional Lie algebras, whicl are studied in the paper.

[99]  arXiv:2206.13348 (cross-list from cs.RO) [pdf]
Title: A Novel Unified Self-alignment Method of SINS Based on FGO
Comments: 9 pages, Journal Papers
Subjects: Robotics (cs.RO); Signal Processing (eess.SP)

The self-alignment process can provide an accurate initial attitude of SINS. The conventional two-procedure method usually includes coarse and fine alignment processes. Coarse alignment is usually based on the OBA (optimization-based alignment) method, batch estimates the constant initial attitude at the beginning of self-alignment. OBA converges rapidly, however, the accuracy is low because the method doesn't consider IMU's bias errors. The fine alignment applies a recursive Bayesian filter which makes the system error estimation of the IMU more accurate, but at the same time, the attitude error converges slowly with a large heading misalignment angle. Researchers have proposed the unified self-alignment to achieve self-alignment in one procedure, but when the misalignment angle is large, the existing methods based on recursive Bayesian filter are still slow to converge. In this paper, a unified method based on batch estimator FGO (factor graph optimization) is raised. To the best as the author known, this is the first batch method capable of estimating all the systematic errors of IMU and the constant initial attitude simultaneously, with fast convergence and high accuracy. The effectiveness of this method is verified by simulation and physical experiments on a rotation SINS.

[100]  arXiv:2206.13356 (cross-list from cs.CV) [pdf, other]
Title: iExam: A Novel Online Exam Monitoring and Analysis System Based on Face Detection and Recognition
Comments: This is a technical report from the Chinese University of Hong Kong
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Online exams via video conference software like Zoom have been adopted in many schools due to COVID-19. While it is convenient, it is challenging for teachers to supervise online exams from simultaneously displayed student Zoom windows. In this paper, we propose iExam, an intelligent online exam monitoring and analysis system that can not only use face detection to assist invigilators in real-time student identification, but also be able to detect common abnormal behaviors (including face disappearing, rotating faces, and replacing with a different person during the exams) via a face recognition-based post-exam video analysis. To build such a novel system in its first kind, we overcome three challenges. First, we discover a lightweight approach to capturing exam video streams and analyzing them in real time. Second, we utilize the left-corner names that are displayed on each student's Zoom window and propose an improved OCR (optical character recognition) technique to automatically gather the ground truth for the student faces with dynamic positions. Third, we perform several experimental comparisons and optimizations to efficiently shorten the training and testing time required on teachers' PC. Our evaluation shows that iExam achieves high accuracy, 90.4% for real-time face detection and 98.4% for post-exam face recognition, while maintaining acceptable runtime performance. We have made iExam's source code available at https://github.com/VPRLab/iExam.

[101]  arXiv:2206.13370 (cross-list from cs.IT) [pdf, other]
Title: Adaptive Decoding Mechanisms for UAV-enabled Double-Uplink Coordinated NOMA
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)

In this paper, we propose a novel adaptive decoding mechanism (ADM) for the unmanned aerial vehicle (UAV)-enabled uplink (UL) non-orthogonal multiple access (NOMA) communications. Specifically, considering a harsh UAV environment where ground-to-ground links are regularly unavailable, the proposed ADM overcomes the challenging problem of conventional UL-NOMA systems whose performance is sensitive to the transmitter's statistical channel state information and the receiver's decoding order. To evaluate the performance of the ADM, we derive closed-form expressions for the system outage probability (OP) and throughput. In the performance analysis, we provide novel expressions for practical air-to-ground and ground-to-air channels while taking into account the practical implementation of imperfect successive interference cancellation (SIC) in UL-NOMA. These results have not been previously reported in the technical literature. Moreover, the obtained expression can be adopted to characterize the OP of various systems under a Mixture of Gamma (MG) distribution-based fading channels. Next, we propose a sub-optimal Gradient Descent-based algorithm to obtain the power allocation coefficients that result in maximum throughput with respect to each location on UAV's trajectory, which follows a random waypoint mobility model for UAVs. Numerical solutions show that the ADM significantly improves the performance of UAV-enabled UL-NOMA, particularly in mobile environments.

[102]  arXiv:2206.13382 (cross-list from cs.IT) [pdf, ps, other]
Title: Multicarrier Modulation on Delay-Doppler Plane: Achieving Orthogonality with Fine Resolutions
Authors: Hai Lin, Jinhong Yuan
Comments: This paper was presented at the IEEE ICC 2022
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)

In this paper, we investigate the design of a novel multicarrier (MC) modulation on delay-Doppler (DD) plane, to couple the modulated signal with a doubly-selective channel having DD resolutions. A key challenge for the design of DD plane MC modulation is to find a realizable pulse orthogonal with respect to the DD plane's fine resolutions. To this end, we first indicate that a feasible DD plane MC modulation is essentially a type of staggered multitone modulation. Then, we propose an orthogonal delay-Doppler division multiplexing (ODDM) modulation, and design the corresponding transmit pulse. Most importantly, we prove that the proposed transmit pulse is orthogonal with respect to the DD plane's resolutions and therefore a realizable DD plane orthogonal pulse does exist. Finally, we demonstrate the superior performance of the proposed ODDM modulation in terms of out-of-band radiation and bit error rate.

[103]  arXiv:2206.13388 (cross-list from cs.CV) [pdf]
Title: Transform-Invariant Convolutional Neural Networks for Image Classification and Search
Authors: David Yevick
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

This paper demonstrates that a simple modification of the variational autoencoder (VAE) formalism enables the method to identify and classify rotated and distorted digits. In particular, the conventional objective (cost) function employed during the training process of a VAE both quantifies the agreement between the input and output data records and ensures that the latent space representation of the input data record is statistically generated with an appropriate mean and standard deviation. After training, simulated data realizations are generated by decoding appropriate latent space points. Since, however, standard VAE:s trained on randomly rotated MNIST digits cannot reliably distinguish between different digit classes since the rotated input data is effectively compared to a similarly rotated output data record. In contrast, an alternative implementation in which the objective function compares the output associated with each rotated digit to a corresponding fixed unreferenced reference digit is shown here to discriminate accurately among the rotated digits in latent space even when the dimension of the latent space is 2 or 3.

[104]  arXiv:2206.13390 (cross-list from cs.CV) [pdf, other]
Title: A Comprehensive Survey on Video Saliency Detection with Auditory Information: the Audio-visual Consistency Perceptual is the Key!
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Video saliency detection (VSD) aims at fast locating the most attractive objects/things/patterns in a given video clip. Existing VSD-related works have mainly relied on the visual system but paid less attention to the audio aspect, while, actually, our audio system is the most vital complementary part to our visual system. Also, audio-visual saliency detection (AVSD), one of the most representative research topics for mimicking human perceptual mechanisms, is currently in its infancy, and none of the existing survey papers have touched on it, especially from the perspective of saliency detection. Thus, the ultimate goal of this paper is to provide an extensive review to bridge the gap between audio-visual fusion and saliency detection. In addition, as another highlight of this review, we have provided a deep insight into key factors which could directly determine the performances of AVSD deep models, and we claim that the audio-visual consistency degree (AVC) -- a long-overlooked issue, can directly influence the effectiveness of using audio to benefit its visual counterpart when performing saliency detection. Moreover, in order to make the AVC issue more practical and valuable for future followers, we have newly equipped almost all existing publicly available AVSD datasets with additional frame-wise AVC labels. Based on these upgraded datasets, we have conducted extensive quantitative evaluations to ground our claim on the importance of AVC in the AVSD task. In a word, both our ideas and new sets serve as a convenient platform with preliminaries and guidelines, all of which are very potential to facilitate future works in promoting state-of-the-art (SOTA) performance further.

[105]  arXiv:2206.13415 (cross-list from cs.CL) [pdf, ps, other]
Title: Is the Language Familiarity Effect gradual? A computational modelling approach
Comments: 8 pages, 2 figures, accepted at CogSci 2022
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

According to the Language Familiarity Effect (LFE), people are better at discriminating between speakers of their native language. Although this cognitive effect was largely studied in the literature, experiments have only been conducted on a limited number of language pairs and their results only show the presence of the effect without yielding a gradual measure that may vary across language pairs. In this work, we show that the computational model of LFE introduced by Thorburn, Feldmand and Schatz (2019) can address these two limitations. In a first experiment, we attest to this model's capacity to obtain a gradual measure of the LFE by replicating behavioural findings on native and accented speech. In a second experiment, we evaluate LFE on a large number of language pairs, including many which have never been tested on humans. We show that the effect is replicated across a wide array of languages, providing further evidence of its universality. Building on the gradual measure of LFE, we also show that languages belonging to the same family yield smaller scores, supporting the idea of an effect of language distance on LFE.

[106]  arXiv:2206.13418 (cross-list from cs.IT) [pdf, other]
Title: Belief-selective Propagation Detection for MIMO Systems
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)

Compared to the linear MIMO detectors, the Belief Propagation (BP) detector has shown greater capabilities in achieving near optimal performance and better nature to iteratively cooperate with channel decoders. Aiming at real applications, recent works mainly fall into the category of reducing the complexity by simplified calculations, at the expense of performance sacrifice. However, the complexity is still unsatisfactory with exponentially increasing complexity or required exponentiation operations. Furthermore, due to the inherent loopy structure, the existing BP detectors persistently encounter error floor in high signal-to-noise ratio (SNR) region, which becomes even worse with calculation approximation. This work aims at a revised BP detector, named {Belief-selective Propagation (BsP)} detector by selectively utilizing the \emph{trusted} incoming messages with sufficiently large \textit{a priori} probabilities for updates. Two proposed strategies: symbol-based truncation (ST) and edge-based simplification (ES) squeeze the complexity (orders lower than the Original-BP), while greatly relieving the error floor issue over a wide range of antenna and modulation combinations. For the $16$-QAM $8 \times 4$ MIMO system, the $\mathcal{B}(1,1)$ {BsP} detector achieves more than $4$\,dB performance gain (@$\text{BER}=10^{-4}$) with roughly $4$ orders lower complexity than the Original-BP detector. Trade-off between performance and complexity towards different application requirement can be conveniently obtained by configuring the ST and ES parameters.

[107]  arXiv:2206.13441 (cross-list from cs.AI) [pdf, other]
Title: EMVLight: a Multi-agent Reinforcement Learning Framework for an Emergency Vehicle Decentralized Routing and Traffic Signal Control System
Comments: 19 figures, 10 tables. arXiv admin note: substantial text overlap with arXiv:2109.05429, arXiv:2111.00278
Subjects: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)

Emergency vehicles (EMVs) play a crucial role in responding to time-critical calls such as medical emergencies and fire outbreaks in urban areas. Existing methods for EMV dispatch typically optimize routes based on historical traffic-flow data and design traffic signal pre-emption accordingly; however, we still lack a systematic methodology to address the coupling between EMV routing and traffic signal control. In this paper, we propose EMVLight, a decentralized reinforcement learning (RL) framework for joint dynamic EMV routing and traffic signal pre-emption. We adopt the multi-agent advantage actor-critic method with policy sharing and spatial discounted factor. This framework addresses the coupling between EMV navigation and traffic signal control via an innovative design of multi-class RL agents and a novel pressure-based reward function. The proposed methodology enables EMVLight to learn network-level cooperative traffic signal phasing strategies that not only reduce EMV travel time but also shortens the travel time of non-EMVs. Simulation-based experiments indicate that EMVLight enables up to a $42.6\%$ reduction in EMV travel time as well as an $23.5\%$ shorter average travel time compared with existing approaches.

[108]  arXiv:2206.13476 (cross-list from cs.SD) [pdf, other]
Title: Impact of Acoustic Event Tagging on Scene Classification in a Multi-Task Learning Framework
Comments: Accepted at ISCA Interspeech 2022
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)

Acoustic events are sounds with well-defined spectro-temporal characteristics which can be associated with the physical objects generating them. Acoustic scenes are collections of such acoustic events in no specific temporal order. Given this natural linkage between events and scenes, a common belief is that the ability to classify events must help in the classification of scenes. This has led to several efforts attempting to do well on Acoustic Event Tagging (AET) and Acoustic Scene Classification (ASC) using a multi-task network. However, in these efforts, improvement in one task does not guarantee an improvement in the other, suggesting a tension between ASC and AET. It is unclear if improvements in AET translates to improvements in ASC. We explore this conundrum through an extensive empirical study and show that under certain conditions, using AET as an auxiliary task in the multi-task network consistently improves ASC performance. Additionally, ASC performance further improves with the AET data-set size and is not sensitive to the choice of events or the number of events in the AET data-set. We conclude that this improvement in ASC performance comes from the regularization effect of using AET and not from the network's improved ability to discern between acoustic events.

Replacements for Tue, 28 Jun 22

[109]  arXiv:2002.02939 (replaced) [pdf]
Title: Phase Retrieval for Partially Coherent Observations
Comments: 12 pages, 14 figures
Journal-ref: IEEE Transactions on Signal Processing, vol. 69, pp. 1394-1406, 2021
Subjects: Signal Processing (eess.SP); Information Theory (cs.IT); Numerical Analysis (math.NA); Optimization and Control (math.OC); Optics (physics.optics)
[110]  arXiv:2005.11149 (replaced) [pdf, other]
Title: On compression rate of quantum autoencoders: Control design, numerical and experimental realization
Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG); Systems and Control (eess.SY)
[111]  arXiv:2011.14582 (replaced) [pdf, ps, other]
Title: Polar-Cap Codebook Design for MISO Rician Fading Channels with Limited Feedback
Comments: 5 pages, 4 figures, and published in IEEE Wireless Communications Letters
Journal-ref: IEEE Wireless Communications Letters, Volume: 10, Issue: 4, April 2021
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
[112]  arXiv:2101.12086 (replaced) [pdf, other]
Title: Risk-sensitive safety analysis using Conditional Value-at-Risk
Journal-ref: IEEE Transactions on Automatic Control, 2022
Subjects: Systems and Control (eess.SY)
[113]  arXiv:2103.02136 (replaced) [pdf, other]
Title: Toward a Scalable Upper Bound for a CVaR-LQ Problem
Comments: This version of the article makes almost-everywhere notions explicit (Lemma 3, Theorem 2)
Journal-ref: IEEE Control Systems Letters, vol. 6, pp. 920-925, 2021
Subjects: Systems and Control (eess.SY)
[114]  arXiv:2103.11978 (replaced) [pdf, other]
Title: Meta-learning Based Beamforming Design for MISO Downlink
Comments: conference
Journal-ref: IEEE International Symposium on Information Theory (ISIT), 2021, pp. 2954-2959
Subjects: Signal Processing (eess.SP); Information Theory (cs.IT)
[115]  arXiv:2105.13647 (replaced) [pdf, ps, other]
Title: Hybrid Beamforming for Intelligent Reflecting Surface Aided Millimeter Wave MIMO Systems
Comments: 32 pages, 9 figures. This paper is accepted to IEEE Transactions on Wireless Communications for publication
Subjects: Signal Processing (eess.SP); Information Theory (cs.IT)
[116]  arXiv:2106.01100 (replaced) [pdf, other]
Title: Prediction of the Position of External Markers Using a Recurrent Neural Network Trained With Unbiased Online Recurrent Optimization for Safe Lung Cancer Radiotherapy
Comments: 24 pages, 16 figures, minor text improvements (English writing)
Journal-ref: Computer Methods and Programs in Biomedicine, Volume 222, 2022, p.106908
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
[117]  arXiv:2106.12068 (replaced) [pdf, other]
Title: The Rate of Convergence of Variation-Constrained Deep Neural Networks
Authors: Gen Li, Jie Ding
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
[118]  arXiv:2108.02230 (replaced) [pdf, other]
Title: Nonholonomic dynamics and control of road vehicles: moving toward automation
Authors: Wubing B. Qin (1), Yiming Zhang (1), Dénes Takács (2), Gábor Stépán (2), Gábor Orosz (1) ((1) University of Michigan, (2) Budapest University of Technology and Economics)
Comments: 42 pages, 25 figures, 5 tables, accepted for inclusion in a future issue in Nonlinear Dynamics, Springer
Subjects: Systems and Control (eess.SY)
[119]  arXiv:2109.14900 (replaced) [pdf, other]
Title: Impact of Channel Variation on One-Class Learning for Spoof Detection
Subjects: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[120]  arXiv:2110.03098 (replaced) [pdf, other]
Title: CTC Variations Through New WFST Topologies
Comments: Accepted to Interspeech 2022, 5 pages, 2 figures, 7 tables
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
[121]  arXiv:2110.03299 (replaced) [pdf, other]
Title: End-To-End Label Uncertainty Modeling for Speech-based Arousal Recognition Using Bayesian Neural Networks
Comments: ACCEPTED to INTERSPEECH 2022
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[122]  arXiv:2110.05354 (replaced) [pdf, ps, other]
Title: Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition
Comments: 5 pages, in Interspeech 2022
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[123]  arXiv:2110.09121 (replaced) [pdf, ps, other]
Title: KaraTuner: Towards end to end natural pitch correction for singing voice in karaoke
Comments: To be published in Proc. Interspeech 2022, Incheon, South Korea
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[124]  arXiv:2111.09982 (replaced) [pdf, other]
Title: Second-Order Mirror Descent: Convergence in Games Beyond Averaging and Discounting
Comments: 16 pages, 12 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Subjects: Optimization and Control (math.OC); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS)
[125]  arXiv:2112.07935 (replaced) [pdf, other]
Title: RawNeXt: Speaker verification system for variable-duration utterances with deep layer aggregation and extended dynamic scaling policies
Comments: 5 pages, 2 figures, 4 tables, accepted to 2022 ICASSP as a conference paper
Subjects: Audio and Speech Processing (eess.AS)
[126]  arXiv:2112.11716 (replaced) [pdf, other]
Title: Comparing radiologists' gaze and saliency maps generated by interpretability methods for chest x-rays
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
[127]  arXiv:2201.01389 (replaced) [pdf, other]
Title: Semantic Communications: Principles and Challenges
Subjects: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
[128]  arXiv:2201.02053 (replaced) [pdf, other]
Title: Channel Estimation and Multipath Diversity Reception for RIS-Empowered Broadband Wireless Systems Based on Cyclic-Prefixed Single-Carrier Transmission
Comments: Submitted to an IEEE Journal
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
[129]  arXiv:2201.03713 (replaced) [pdf, other]
Title: CVSS Corpus and Massively Multilingual Speech-to-Speech Translation
Comments: LREC 2022
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[130]  arXiv:2201.04472 (replaced) [pdf, other]
Title: Numerical and Experimental Characterization of LoRa-Based Helmet-to-Unmanned Aerial Vehicle Links on Flat Lands: A Numerical-Statistical Approach to Link Modeling
Journal-ref: IEEE Antennas and Propagation Magazine, 2022
Subjects: Systems and Control (eess.SY)
[131]  arXiv:2201.05213 (replaced) [pdf, other]
Title: Parallel Neural Local Lossless Compression
Subjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Machine Learning (stat.ML)
[132]  arXiv:2201.06052 (replaced) [pdf, other]
Title: Self-Supervision and Multi-Task Learning: Challenges in Fine-Grained COVID-19 Multi-Class Classification from Chest X-rays
Comments: Accepted to Conference on Medical Image Understanding and Analysis (MIUA) 2022
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[133]  arXiv:2201.06262 (replaced) [pdf, other]
Title: Optimisation of Structured Neural Controller Based on Continuous-Time Policy Gradient
Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Subjects: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
[134]  arXiv:2202.01641 (replaced) [pdf, other]
Title: Coupled Splines for Sparse Curve Fitting
Subjects: Image and Video Processing (eess.IV)
[135]  arXiv:2202.03472 (replaced) [pdf, ps, other]
Title: New Bounds on the Size of Binary Codes with Large Minimum Distance
Comments: ISIT 2022 camera-ready version. 5 pages of content, 1 page of references
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
[136]  arXiv:2203.00131 (replaced) [pdf, other]
Title: A Data-scalable Transformer for Medical Image Segmentation: Architecture, Model Efficiency, and Benchmark
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[137]  arXiv:2203.01786 (replaced) [pdf, other]
Title: Generative Modeling for Low Dimensional Speech Attributes with Neural Spline Flows
Comments: 22 pages, 11 figures, 3 tables
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[138]  arXiv:2203.03635 (replaced) [pdf, ps, other]
Title: Stepwise Feature Fusion: Local Guides Global
Comments: 10 pages, 5 figures
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[139]  arXiv:2203.04496 (replaced) [pdf, other]
Title: Millimeter-Scale Ultra-Low-Power Imaging System for Intelligent Edge Monitoring
Comments: 7 pages, 8 figures, tinyML Research Symposium 2022; revised author list
Subjects: Signal Processing (eess.SP)
[140]  arXiv:2203.04866 (replaced) [pdf, other]
Title: Joint-optimization of Node placement and UAV's Trajectory for Self-sustaining Air-Ground IoT system
Subjects: Signal Processing (eess.SP)
[141]  arXiv:2203.05780 (replaced) [pdf, other]
Title: Acoustic To Articulatory Speech Inversion Using Multi-Resolution Spectro-Temporal Representations Of Speech Signals
Comments: Accepted at ISCA Interspeech 2022
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
[142]  arXiv:2203.08651 (replaced) [pdf, ps, other]
Title: Construction of time-varying ISS-Lyapunov Functions for Impulsive Systems
Subjects: Systems and Control (eess.SY)
[143]  arXiv:2203.09132 (replaced) [pdf, other]
Title: Feature-informed Latent Space Regularization for Music Source Separation
Subjects: Audio and Speech Processing (eess.AS)
[144]  arXiv:2203.10750 (replaced) [pdf, other]
Title: WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses
Comments: accepted at InterSpeech2022
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
[145]  arXiv:2203.12215 (replaced) [pdf, other]
Title: Physics-Driven Deep Learning for Computational Magnetic Resonance Imaging
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP); Medical Physics (physics.med-ph)
[146]  arXiv:2203.13628 (replaced) [pdf, other]
Title: DeLoRes: Decorrelating Latent Spaces for Low-Resource Audio Representation Learning
Comments: Accepted to AAAI 2022 workshop on Self-supervised Learning for Audio and Speech Processing
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[147]  arXiv:2203.15081 (replaced) [pdf, other]
Title: Word Discovery in Visually Grounded, Self-Supervised Speech Models
Comments: Interspeech 2022
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
[148]  arXiv:2203.15149 (replaced) [pdf, other]
Title: CMGAN: Conformer-based Metric GAN for Speech Enhancement
Comments: 5 pages, 1 figure, 2 tables, accepted at INTERSPEECH 2022
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[149]  arXiv:2203.15405 (replaced) [pdf, other]
Title: Automatic Detection of Speech Sound Disorder in Child Speech Using Posterior-based Speaker Representations
Comments: Accepted to Interspeech 2022
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[150]  arXiv:2203.15952 (replaced) [pdf, other]
Title: 4-bit Conformer with Native Quantization Aware Training for Speech Recognition
Comments: Accepted by INTERSPEECH 2022
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
[151]  arXiv:2203.16080 (replaced) [pdf, other]
Title: Asymmetric Proxy Loss for Multi-View Acoustic Word Embeddings
Comments: Accepted to Interspeech 2022
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[152]  arXiv:2203.17019 (replaced) [pdf, other]
Title: DeepFry: Identifying Vocal Fry Using Deep Neural Networks
Comments: Accepted to Interspeech 2022
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[153]  arXiv:2203.17152 (replaced) [pdf, other]
Title: Perceptual Contrast Stretching on Target Feature for Speech Enhancement
Comments: Accepted by Interspeech 2022
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[154]  arXiv:2204.00588 (replaced) [pdf, other]
Title: Prefix-Free Coding for LQG Control
Comments: Under submission to the IEEE Journal on Selected Areas in Information Theory (Modern Compression Issue). Added some corrections we noticed before the review came back
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP); Systems and Control (eess.SY); Optimization and Control (math.OC)
[155]  arXiv:2204.00890 (replaced) [pdf, other]
Title: From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization
Comments: Accepted at Interspeech 2022
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[156]  arXiv:2204.03173 (replaced) [pdf, other]
Title: Automated Sleep Staging via Parallel Frequency-Cut Attention
Comments: 10 pages, 9 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
[157]  arXiv:2204.03793 (replaced) [pdf, other]
Title: Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition
Comments: Accepted by INTERSPEECH 2022
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[158]  arXiv:2204.04016 (replaced) [pdf, other]
Title: Disentangled Latent Speech Representation for Automatic Pathological Intelligibility Assessment
Comments: Submitted and Accepted at INTERSPEECH2022
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Quantitative Methods (q-bio.QM)
[159]  arXiv:2204.05201 (replaced) [pdf]
Title: A Post-Processing Tool and Feasibility Study for Three-Dimensional Imaging with Electrical Impedance Tomography During Deep Brain Stimulation Surgery
Authors: Sebastien Martin
Subjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
[160]  arXiv:2204.05278 (replaced) [pdf, other]
Title: Neglectable effect of brain MRI data preprocessing for tumor segmentation
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[161]  arXiv:2204.06164 (replaced) [pdf, other]
Title: A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes
Comments: Accepted by INTERSPEECH 2022
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[162]  arXiv:2204.10586 (replaced) [pdf, ps, other]
Title: Efficient Training of Neural Transducer for Speech Recognition
Comments: accepted at Interspeech 2022
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[163]  arXiv:2204.12765 (replaced) [pdf, other]
Title: Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?
Comments: Accepted by INTERSPEECH 2022
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[164]  arXiv:2205.00693 (replaced) [pdf, other]
Title: Contrastive Learning for Improving ASR Robustness in Spoken Language Understanding
Comments: Accepted by INTERSPEECH 2022
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[165]  arXiv:2205.07450 (replaced) [pdf, other]
Title: PRISM: Pre-trained Indeterminate Speaker Representation Model for Speaker Diarization and Speaker Verification
Comments: INTERSPEECH 2022
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[166]  arXiv:2205.10215 (replaced) [pdf, ps, other]
Title: Audio Declipping with (Weighted) Analysis Social Sparsity
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[167]  arXiv:2205.11748 (replaced) [pdf, other]
Title: Deep Learning-based automated classification of Chinese Speech Sound Disorders
Comments: 17 pages, 9 figures, MDPI Children journal
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[168]  arXiv:2205.11801 (replaced) [pdf, other]
Title: SepIt: Approaching a Single Channel Speech Separation Bound
Comments: Accepted to INTERSPEECH 2022
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD); Machine Learning (stat.ML)
[169]  arXiv:2206.00515 (replaced) [pdf, other]
Title: Landslide4Sense: Reference Benchmark Data and Deep Learning Models for Landslide Detection
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
[170]  arXiv:2206.01856 (replaced) [pdf, other]
Title: Poisson2Sparse: Self-Supervised Poisson Denoising From a Single Image
Comments: Accepted to MICCAI 2022
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[171]  arXiv:2206.04548 (replaced) [pdf, other]
Title: Classification of COVID-19 in Chest X-ray Images Using Fusion of Deep Features and LightGBM
Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[172]  arXiv:2206.06043 (replaced) [pdf, other]
Title: Combining BMC and Fuzzing Techniques for Finding Software Vulnerabilities in Concurrent Programs
Subjects: Software Engineering (cs.SE); Systems and Control (eess.SY)
[173]  arXiv:2206.06192 (replaced) [pdf, ps, other]
Title: Toward Zero Oracle Word Error Rate on the Switchboard Benchmark
Comments: Submitted to Interspeech 2022
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[174]  arXiv:2206.06813 (replaced) [pdf, other]
Title: Learning towards Synchronous Network Memorizability and Generalizability for Continual Segmentation across Multiple Sites
Comments: Early accepted in MICCAI2022
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[175]  arXiv:2206.07811 (replaced) [pdf, other]
Title: Safety Guarantees for Neural Network Dynamic Systems via Stochastic Barrier Functions
Subjects: Systems and Control (eess.SY); Robotics (cs.RO)
[176]  arXiv:2206.08189 (replaced) [pdf, other]
Title: Censer: Curriculum Semi-supervised Learning for Speech Recognition Based on Self-supervised Pre-training
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[177]  arXiv:2206.08835 (replaced) [pdf, other]
Title: What can Speech and Language Tell us About the Working Alliance in Psychotherapy
Comments: Accepted at Interspeech 2022
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[178]  arXiv:2206.09381 (replaced) [pdf, other]
Title: Graph Neural Network Aided MU-MIMO Detectors
Comments: Source Code: this https URL
Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI)
[179]  arXiv:2206.10120 (replaced) [pdf, other]
Title: DECAL: DEployable Clinical Active Learning
Comments: ICML 2022 Workshop on Adaptive Experimental Design and Active Learning in the Real World
Subjects: Image and Video Processing (eess.IV)
[180]  arXiv:2206.10397 (replaced) [pdf, other]
Title: Neural Moving Horizon Estimation for Robust Flight Control
Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
[181]  arXiv:2206.11053 (replaced) [pdf, other]
Title: Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer
Comments: Code: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
[182]  arXiv:2206.11485 (replaced) [pdf, other]
Title: Patient Aware Active Learning for Fine-Grained OCT Classification
Comments: IEEE International Conference on Image Processing (ICIP)
Subjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
[183]  arXiv:2206.11501 (replaced) [pdf, other]
Title: A novel adversarial learning strategy for medical image classification
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[ total of 183 entries: 1-183 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, eess, recent, 2206, contact, help  (Access key information)