Distributed, Parallel, and Cluster Computing
New submissions
[ showing up to 250 entries per page: fewer | more ]
New submissions for Thu, 28 Mar 24
- [1] arXiv:2403.17940 [pdf, ps, other]
-
Title: Navigating the Docker Ecosystem: A Comprehensive Taxonomy and SurveySubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
The cloud computing landscape is rapidly expanding and growing in complexity. It has witnessed the emergence of Cloud Computing as a widely adopted model for efficiently processing large volumes of data by harnessing clusters of commodity computers. This evolution enables the handling of massive data through on-demand services, relying on numerous microservices with diverse dependencies. The technology of containers ensures secure storage, allowing for largescale data processing with high scalability and portability. Container technology, particularly exemplified by Docker in the last decade, plays a pivotal role in this scenario. It empowers microservices to process data swiftly, enabling developers to dynamically scale these services in real-time. This paper initiates by establishing a comprehensive taxonomy for delineating container architecture. Focusing specifically on Docker containers, we scrutinize various existing container related literature. Through this taxonomy and survey, we not only discern similarities and disparities in the architectural approaches of Docker container technology but also pinpoint areas necessitating further research.
- [2] arXiv:2403.18073 [pdf, other]
-
Title: Workflow Mini-Apps: Portable, Scalable, Tunable & Faithful Representations of Scientific WorkflowsAuthors: Ozgur Ozan Kilic, Tianle Wang, Matteo Turilli, Mikhail Titov, Andre Merzky, Line Pouchard, Shantenu JhaSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Workflows are critical for scientific discovery. However, the sophistication, heterogeneity, and scale of workflows make building, testing, and optimizing them increasingly challenging. Furthermore, their complexity and heterogeneity make performance reproducibility hard. In this paper, we propose workflow mini-apps as a tool to address the challenges in building and testing workflows while controlling the fidelity of representing realworld workflows. Workflow mini-apps are deployed and run on various HPC systems and architectures without workflow-specific constraints. We offer insight into their design and implementation, providing an analysis of their performance and reproducibility. Workflow mini-apps thus advance the science of workflows by providing simple, portable, and managed (fidelity) representations of otherwise complex and difficult-to-control real workflows.
- [3] arXiv:2403.18374 [pdf, other]
-
Title: Optimizing Communication for Latency Sensitive HPC Applications on up to 48 FPGAs Using ACCLAuthors: Marius Meyer, Tobias Kenter, Lucian Petrica, Kenneth O'Brien, Michaela Blott, Christian PesslSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR)
Most FPGA boards in the HPC domain are well-suited for parallel scaling because of the direct integration of versatile and high-throughput network ports. However, the utilization of their network capabilities is often challenging and error-prone because the whole network stack and communication patterns have to be implemented and managed on the FPGAs. Also, this approach conceptually involves a trade-off between the performance potential of improved communication and the impact of resource consumption for communication infrastructure, since the utilized resources on the FPGAs could otherwise be used for computations. In this work, we investigate this trade-off, firstly, by using synthetic benchmarks to evaluate the different configuration options of the communication framework ACCL and their impact on communication latency and throughput. Finally, we use our findings to implement a shallow water simulation whose scalability heavily depends on low-latency communication. With a suitable configuration of ACCL, good scaling behavior can be shown to all 48 FPGAs installed in the system. Overall, the results show that the availability of inter-FPGA communication frameworks as well as the configurability of framework and network stack are crucial to achieve the best application performance with low latency communication.
- [4] arXiv:2403.18509 [pdf, ps, other]
-
Title: Distributed Maximum Consensus over Noisy LinksComments: 5 pages, 7 figures, submitted to EUSIPCO 2024 conferenceSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Signal Processing (eess.SP)
We introduce a distributed algorithm, termed noise-robust distributed maximum consensus (RD-MC), for estimating the maximum value within a multi-agent network in the presence of noisy communication links. Our approach entails redefining the maximum consensus problem as a distributed optimization problem, allowing a solution using the alternating direction method of multipliers. Unlike existing algorithms that rely on multiple sets of noise-corrupted estimates, RD-MC employs a single set, enhancing both robustness and efficiency. To further mitigate the effects of link noise and improve robustness, we apply moving averaging to the local estimates. Through extensive simulations, we demonstrate that RD-MC is significantly more robust to communication link noise compared to existing maximum-consensus algorithms.
- [5] arXiv:2403.18545 [pdf, other]
-
Title: Optimal Resource Efficiency with Fairness in Heterogeneous GPU ClustersSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Ensuring the highest training throughput to maximize resource efficiency, while maintaining fairness among users, is critical for deep learning (DL) training in heterogeneous GPU clusters. However, current DL schedulers provide only limited fairness properties and suboptimal training throughput, impeding tenants from effectively leveraging heterogeneous resources. The underlying design challenge stems from inherent conflicts between efficiency and fairness properties.
In this paper, we introduce OEF, a new resource allocation framework specifically developed for achieving optimal resource efficiency and ensuring diverse fairness properties in heterogeneous GPU clusters. By integrating resource efficiency and fairness within a global optimization framework, OEF is capable of providing users with maximized overall efficiency, as well as various guarantees of fairness, in both cooperative and non-cooperative environments. We have implemented OEF in a cluster resource manager and conducted large-scale experiments, showing that OEF can improve the overall training throughput by up to 32% while improving fairness compared to state-of-the-art heterogeneity-aware schedulers. - [6] arXiv:2403.18619 [pdf, other]
-
Title: Enhanced OpenMP Algorithm to Compute All-Pairs Shortest Path on x86 ArchitecturesComments: Accepted for publication in Computer Science - CACIC 2023Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Graphs have become a key tool when modeling and solving problems in different areas. The Floyd-Warshall (FW) algorithm computes the shortest path between all pairs of vertices in a graph and is employed in areas like communication networking, traffic routing, bioinformatics, among others. However, FW is computationally and spatially expensive since it requires O(n^3) operations and O(n^2) memory space. As the graph gets larger, parallel computing becomes necessary to provide a solution in an acceptable time range. In this paper, we studied a FW code developed for Xeon Phi KNL processors and adapted it to run on any Intel x86 processors, losing the specificity of the former. To do so, we verified one by one the optimizations proposed by the original code, making adjustments to the base code where necessary, and analyzing its performance on two Intel servers under different test scenarios. In addition, a new optimization was proposed to increase the concurrency degree of the parallel algorithm, which was implemented using two different synchronization mechanisms. The experimental results show that all optimizations were beneficial on the two x86 platforms selected. Last, the new optimization proposal improved performance by up to 23%.
- [7] arXiv:2403.18639 [pdf, other]
-
Title: Dependency Aware Incident Linking in Large Cloud SystemsAuthors: Supriyo Ghosh, Karish Grover, Jimmy Wong, Chetan Bansal, Rakesh Namineni, Mohit Verma, Saravan RajmohanSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Despite significant reliability efforts, large-scale cloud services inevitably experience production incidents that can significantly impact service availability and customer's satisfaction. Worse, in many cases one incident can lead to multiple downstream failures due to cascading effects that creates several related incidents across different dependent services. Often time On-call Engineers (OCEs) examine these incidents in silos that lead to significant amount of manual toil and increase the overall time-to-mitigate incidents. Therefore, developing efficient incident linking models is of paramount importance for grouping related incidents into clusters so as to quickly resolve major outages and reduce on-call fatigue. Existing incident linking methods mostly leverages textual and contextual information of incidents (e.g., title, description, severity, impacted components), thus failing to leverage the inter-dependencies between services. In this paper, we propose the dependency-aware incident linking (DiLink) framework which leverages both textual and service dependency graph information to improve the accuracy and coverage of incident links not only coming from same service, but also from different services and workloads. Furthermore, we propose a novel method to align the embeddings of multi-modal (i.e., textual and graphical) data using Orthogonal Procrustes. Extensive experimental results on real-world incidents from 5 workloads of Microsoft demonstrate that our alignment method has an F1-score of 0.96 (14% gain over current state-of-the-art methods). We are also in the process of deploying this solution across 610 services from these 5 workloads for continuously supporting OCEs improving incident management and reducing manual toil.
Cross-lists for Thu, 28 Mar 24
- [8] arXiv:2403.18300 (cross-list from cs.CR) [pdf, other]
-
Title: HotStuff-2 vs. HotStuff: The Difference and AdvantageSubjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
Byzantine consensus protocols are essential in blockchain technology. The widely recognized HotStuff protocol uses cryptographic measures for efficient view changes and reduced communication complexity. Recently, the main authors of HotStuff introduced an advanced iteration named HotStuff-2. This paper aims to compare the principles and analyze the effectiveness of both protocols, hoping to depict their key differences and assess the potential enhancements offered by HotStuff-2.
- [9] arXiv:2403.18326 (cross-list from cs.CR) [pdf, ps, other]
-
Title: Privacy-Preserving Distributed Nonnegative Matrix FactorizationComments: 5 pages, 1 figure, submitted to EUSIPCO 2024 conferenceSubjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Signal Processing (eess.SP)
Nonnegative matrix factorization (NMF) is an effective data representation tool with numerous applications in signal processing and machine learning. However, deploying NMF in a decentralized manner over ad-hoc networks introduces privacy concerns due to the conventional approach of sharing raw data among network agents. To address this, we propose a privacy-preserving algorithm for fully-distributed NMF that decomposes a distributed large data matrix into left and right matrix factors while safeguarding each agent's local data privacy. It facilitates collaborative estimation of the left matrix factor among agents and enables them to estimate their respective right factors without exposing raw data. To ensure data privacy, we secure information exchanges between neighboring agents utilizing the Paillier cryptosystem, a probabilistic asymmetric algorithm for public-key cryptography that allows computations on encrypted data without decryption. Simulation results conducted on synthetic and real-world datasets demonstrate the effectiveness of the proposed algorithm in achieving privacy-preserving distributed NMF over ad-hoc networks.
- [10] arXiv:2403.18641 (cross-list from math.NA) [pdf, other]
-
Title: Improving Efficiency of Parallel Across the Method Spectral Deferred CorrectionsComments: 24 pagesSubjects: Numerical Analysis (math.NA); Distributed, Parallel, and Cluster Computing (cs.DC)
Parallel-across-the method time integration can provide small scale parallelism when solving initial value problems. Spectral deferred corrections (SDC) with a diagonal sweeper, which is closely related to iterated Runge-Kutta methods proposed by Van der Houwen and Sommeijer, can use a number of threads equal to the number of quadrature nodes in the underlying collocation method. However, convergence speed, efficiency and stability depends critically on the used coefficients. Previous approaches have used numerical optimization to find good parameters. Instead, we propose an ansatz that allows to find optimal parameters analytically. We show that the resulting parallel SDC methods provide stability domains and convergence order very similar to those of well established serial SDC variants. Using a model for computational cost that assumes 80% efficiency of an implementation of parallel SDC we show that our variants are competitive with serial SDC, previously published parallel SDC coefficients as well as Picard iteration, explicit RKM-4 and an implicit fourth-order diagonally implicit Runge-Kutta method.
- [11] arXiv:2403.18682 (cross-list from cs.DS) [pdf, other]
-
Title: JumpBackHash: Say Goodbye to the Modulo Operation to Distribute Keys Uniformly to BucketsAuthors: Otmar ErtlComments: 8 pagesSubjects: Data Structures and Algorithms (cs.DS); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)
The distribution of keys to a given number of buckets is a fundamental task in distributed data processing and storage. A simple, fast, and therefore popular approach is to map the hash values of keys to buckets based on the remainder after dividing by the number of buckets. Unfortunately, these mappings are not stable when the number of buckets changes, which can lead to severe spikes in system resource utilization, such as network or database requests. Consistent hash algorithms can minimize remappings, but are either significantly slower than the modulo-based approach, require floating-point arithmetic, or are based on a family of hash functions rarely available in standard libraries. This paper introduces JumpBackHash, which uses only integer arithmetic and a standard pseudorandom generator. Due to its speed and simple implementation, it can safely replace the modulo-based approach to improve assignment and system stability. A production-ready Java implementation of JumpBackHash has been released as part of the Hash4j open source library.
- [12] arXiv:2403.18766 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Superior Parallel Big Data Clustering through Competitive Stochastic Sample Size Optimization in Big-meansSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
This paper introduces a novel K-means clustering algorithm, an advancement on the conventional Big-means methodology. The proposed method efficiently integrates parallel processing, stochastic sampling, and competitive optimization to create a scalable variant designed for big data applications. It addresses scalability and computation time challenges typically faced with traditional techniques. The algorithm adjusts sample sizes dynamically for each worker during execution, optimizing performance. Data from these sample sizes are continually analyzed, facilitating the identification of the most efficient configuration. By incorporating a competitive element among workers using different sample sizes, efficiency within the Big-means algorithm is further stimulated. In essence, the algorithm balances computational time and clustering quality by employing a stochastic, competitive sampling strategy in a parallel computing setting.
Replacements for Thu, 28 Mar 24
- [13] arXiv:2309.11190 (replaced) [pdf, other]
-
Title: Space and Move-optimal Arbitrary Pattern Formation on Infinite Rectangular Grid by Oblivious Robot SwarmSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
- [14] arXiv:2312.13094 (replaced) [pdf, other]
-
Title: Automated MPI code generation for scalable finite-difference solversAuthors: George Bisbas, Rhodri Nelson, Mathias Louboutin, Paul H.J. Kelly, Fabio Luporini, Gerard GormanComments: 10 pages, 12 figures (18 pages with References and Appendix)Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Mathematical Software (cs.MS); Performance (cs.PF)
- [15] arXiv:2402.08950 (replaced) [pdf, other]
-
Title: Taking GPU Programming Models to Task for Performance PortabilityAuthors: Joshua H. Davis, Pranav Sivaraman, Joy Kitson, Konstantinos Parasyris, Harshitha Menon, Isaac Minn, Giorgis Georgakoudis, Abhinav BhateleComments: 12 pages, 7 figuresSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
- [16] arXiv:2403.15721 (replaced) [pdf, other]
-
Title: Design and Implementation of an Analysis Pipeline for Heterogeneous DataAuthors: Arup Kumar Sarker, Aymen Alsaadi, Niranda Perera, Mills Staylor, Gregor von Laszewski, Matteo Turilli, Ozgur Ozan Kilic, Mikhail Titov, Andre Merzky, Shantenu Jha, Geoffrey FoxComments: 14 pages, 16 figures, 2 tablesSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
- [17] arXiv:2102.12920 (replaced) [pdf, ps, other]
-
Title: Emerging Trends in Federated Learning: From Model Fusion to Federated X LearningAuthors: Shaoxiong Ji, Yue Tan, Teemu Saravirta, Zhiqin Yang, Yixin Liu, Lauri Vasankari, Shirui Pan, Guodong Long, Anwar WalidComments: To appear in the International Journal of Machine Learning and CyberneticsSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
- [18] arXiv:2305.13525 (replaced) [pdf, other]
-
Title: A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
- [19] arXiv:2305.17079 (replaced) [pdf, other]
-
Title: Complete Multiparty Session Type Projection with AutomataComments: 24 pages, 44 pages including appendix; CAV 2023Subjects: Formal Languages and Automata Theory (cs.FL); Distributed, Parallel, and Cluster Computing (cs.DC); Programming Languages (cs.PL)
- [20] arXiv:2307.13352 (replaced) [pdf, other]
-
Title: High Dimensional Distributed Gradient Descent with Arbitrary Number of Byzantine AttackersSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
- [21] arXiv:2311.01483 (replaced) [pdf, other]
-
Title: FedSN: A Novel Federated Learning Framework over LEO Satellite NetworksComments: 14 pages, 17 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
- [22] arXiv:2402.01739 (replaced) [pdf, other]
-
Title: OpenMoE: An Early Effort on Open Mixture-of-Experts Language ModelsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
- [23] arXiv:2403.17878 (replaced) [pdf, other]
-
Title: Empowering Data Mesh with Federated LearningSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
[ showing up to 250 entries per page: fewer | more ]
Disable MathJax (What is MathJax?)
Links to: arXiv, form interface, find, cs, recent, 2403, contact, help (Access key information)