We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.DC

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Distributed, Parallel, and Cluster Computing

Title: High-Quality Fault Resiliency in Fat Trees

Authors: John Gliksberg (LI-PaRAD, UCLM), Antoine Capra, Alexandre Louvet, Pedro Javier Garcia (UCLM), Devan Sohier (LI-PaRAD)
Abstract: Coupling regular topologies with optimised routing algorithms is key in pushing the performance of interconnection networks of supercomputers.In this paper we present Dmodc, a fast deterministic routing algorithm for Parallel Generalised Fat-Trees (PGFTs) which minimises congestion risk even under massive network degradation caused by equipment failure.Dmodc computes forwarding tables with a closed-form arithmetic formula by relying on a fast preprocessing phase.This allows complete re-routing of networks with tens of thousands of nodes in less than a second.In turn, this greatly helps centralised fabric management react to faults with high-quality routing tables and no impact to running applications in current and future very large-scale HPC clusters.
Comments: arXiv admin note: text overlap with arXiv:2211.11817
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
Journal reference: IEEE Micro, 2020, 40 (1), pp.44-49. \&\#x27E8;10.1109/MM.2019.2949978\&\#x27E9
Cite as: arXiv:2211.13101 [cs.DC]
  (or arXiv:2211.13101v1 [cs.DC] for this version)

Submission history

From: John Gliksberg [view email]
[v1] Wed, 23 Nov 2022 16:40:42 GMT (58kb,D)

Link back to: arXiv, form interface, contact.