[1]  arXiv:2303.12281 [pdf, other]
Title: Synthetic Health-related Longitudinal Data with Mixed-type Variables Generated using Diffusion Models
Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)

This paper presents a novel approach to simulating electronic health records (EHRs) using diffusion probabilistic models (DPMs). Specifically, we demonstrate the effectiveness of DPMs in synthesising longitudinal EHRs that capture mixed-type variables, including numeric, binary, and categorical variables. To our knowledge, this represents the first use of DPMs for this purpose. We compared our DPM-simulated datasets to previous state-of-the-art results based on generative adversarial networks (GANs) for two clinical applications: acute hypotension and human immunodeficiency virus (ART for HIV). Given the lack of similar previous studies in DPMs, a core component of our work involves exploring the advantages and caveats of employing DPMs across a wide range of aspects. In addition to assessing the realism of the synthetic datasets, we also trained reinforcement learning (RL) agents on the synthetic data to evaluate their utility for supporting the development of downstream machine learning models. Finally, we estimated that our DPM-simulated datasets are secure and posed a low patient exposure risk for public access.

[2]  arXiv:2303.12311 [pdf, other]
Title: Frozen Language Model Helps ECG Zero-Shot Learning
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)

The electrocardiogram (ECG) is one of the most commonly used non-invasive, convenient medical monitoring tools that assist in the clinical diagnosis of heart diseases. Recently, deep learning (DL) techniques, particularly self-supervised learning (SSL), have demonstrated great potential in the classification of ECG. SSL pre-training has achieved competitive performance with only a small amount of annotated data after fine-tuning. However, current SSL methods rely on the availability of annotated data and are unable to predict labels not existing in fine-tuning datasets. To address this challenge, we propose Multimodal ECG-Text Self-supervised pre-training (METS), the first work to utilize the auto-generated clinical reports to guide ECG SSL pre-training. We use a trainable ECG encoder and a frozen language model to embed paired ECG and automatically machine-generated clinical reports separately. The SSL aims to maximize the similarity between paired ECG and auto-generated report while minimize the similarity between ECG and other reports. In downstream classification tasks, METS achieves around 10% improvement in performance without using any annotated data via zero-shot classification, compared to other supervised and SSL baselines that rely on annotated data. Furthermore, METS achieves the highest recall and F1 scores on the MIT-BIH dataset, despite MIT-BIH containing different classes of ECG compared to the pre-trained dataset. The extensive experiments have demonstrated the advantages of using ECG-Text multimodal self-supervised learning in terms of generalizability, effectiveness, and efficiency.

[3]  arXiv:2303.12329 [pdf]
Title: Boosting interoperability: towards an increasingly reusable bioinformatics knowledge base
Subjects: Databases (cs.DB); Quantitative Methods (q-bio.QM)

Background, enhancing interoperability of bioinformatics knowledge bases is a high priority requirement to maximize data reusability, and thus increase their utility such as the return on investment for biomedical research. A knowledge base may provide useful information for life scientists and other knowledge bases, but it only acquires exchange value once the knowledge base is (re)used, and without interoperability the utility lies dormant. Results, in this article, we discuss several approaches to boost interoperability depending on the interoperable parts. The findings are driven by several real-world scenario examples that were mostly implemented by Bgee, a well-established gene expression database. Moreover, we discuss ten general main lessons learnt. These lessons can be applied in the context of any bioinformatics knowledge base to foster data reusability. Conclusions, this work provides pragmatic methods and transferable skills to promote reusability of bioinformatics knowledge bases by focusing on interoperability.

[4]  arXiv:2303.12431 [pdf, other]
Title: Evolutionary Dynamics of a Lattice Dimer: a Toy Model for Stability vs. Affinity Trade-offs in Proteins
Comments: 13 pages, 15 figures
Subjects: Statistical Mechanics (cond-mat.stat-mech); Biological Physics (physics.bio-ph); Quantitative Methods (q-bio.QM)

Understanding how a stressor applied on a biological system shapes its evolution is key to achieving targeted evolutionary control. Here we present a toy model of two interacting lattice proteins to quantify the response to the selective pressure defined by the binding energy. We generate sequence data of proteins and study how the sequence and structural properties of dimers are affected by the applied selective pressure, both during the evolutionary process and in the stationary regime. In particular we show that internal contacts of native structures lose strength, while inter-structure contacts are strengthened due to the folding-binding competition. We discuss how dimerization is achieved through enhanced mutability on the interacting faces, and how the designability of each native structure changes upon introduction of the stressor.

[5]  arXiv:2303.12651 [pdf]
Title: Model Validation and Selection in Metabolic Flux Analysis and Flux Balance Analysis
Comments: 23 pages, 2 figures, 1 table
Subjects: Molecular Networks (q-bio.MN); Quantitative Methods (q-bio.QM)

13C-Metabolic Flux Analysis (13C-MFA) and Flux Balance Analysis (FBA) are widely used to investigate the operation of biochemical networks in both biological and biotechnological research. Both of these methods use metabolic reaction network models of metabolism operating at steady state, so that reaction rates (fluxes) and the levels of metabolic intermediates are constrained to be invariant. They provide estimated (MFA) or predicted (FBA) values of the fluxes through the network in vivo, which cannot be measured directly. A number of approaches have been taken to test the reliability of estimates and predictions from constraint-based methods and to decide on and/or discriminate between alternative model architectures. Despite advances in other areas of the statistical evaluation of metabolic models, validation and model selection methods have been underappreciated and underexplored. We review the history and state-of-the-art in constraint-based metabolic model validation and model selection. Applications and limitations of the X2-test of goodness-of-fit, the most widely used quantitative validation and selection approach in 13C-MFA, are discussed, and complementary and alternative forms of validation and selection are proposed. A combined model validation and selection framework for 13C-MFA incorporating metabolite pool size information that leverages new developments in the field is presented and advocated for. Finally, we discuss how the adoption of robust validation and selection procedures can enhance confidence in constraint-based modeling as a whole and ultimately facilitate more widespread use of FBA in biotechnology in particular.

Replacements for Thu, 23 Mar 23

[6]  arXiv:2208.01918 (replaced) [pdf, other]
Title: DeepProphet2 -- A Deep Learning Gene Recommendation Engine
Authors: Daniele Brambilla (1), Davide Maria Giacomini (1), Luca Muscarnera, Andrea Mazzoleni (1) ((1) TheProphetAI)
Subjects: Quantitative Methods (q-bio.QM); Information Retrieval (cs.IR); Machine Learning (cs.LG)
[7]  arXiv:2303.07876 (replaced) [pdf]
Title: pracpac: Practical R Packaging with Docker
Subjects: Quantitative Methods (q-bio.QM)
