We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.NI

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Networking and Internet Architecture

Title: Flock: Accurate network fault localization at scale

Abstract: Inferring the root cause of failures among thousands of components in a data center network is challenging, especially for "gray" failures that are not reported directly by switches. Faults can be localized through end-to-end measurements, but past localization schemes are either too slow for large-scale networks or sacrifice accuracy. We describe Flock, a network fault localization algorithm and system that achieves both high accuracy and speed at datacenter scale. Flock uses a probabilistic graphical model (PGM) to achieve high accuracy, coupled with new techniques to dramatically accelerate inference in discrete-valued Bayesian PGMs. Large-scale simulations and experiments in a hardware testbed show Flock speeds up inference by >10000x compared to past PGM methods, and improves accuracy over the best previous datacenter fault localization approaches, reducing inference error by 1.19-11x on the same input telemetry, and by 1.2-55x after incorporating passive telemetry. We also prove Flock's inference is optimal in restricted settings
Comments: To appear in ACM PACMNET, Vol 1, June 2023
Subjects: Networking and Internet Architecture (cs.NI)
Cite as: arXiv:2305.03348 [cs.NI]
  (or arXiv:2305.03348v1 [cs.NI] for this version)

Submission history

From: Vipul Harsh [view email]
[v1] Fri, 5 May 2023 08:02:19 GMT (1648kb,D)

Link back to: arXiv, form interface, contact.