We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.DC

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Distributed, Parallel, and Cluster Computing

Title: Soft Error Resilience and Failure Recovery for Continuum Dynamics Applications

Abstract: The persistently growing resilience concerns of large-scale computing systems today require not only generic fault tolerance approaches, but also application-level resilience, due to demanding efficiency and various domain-specific requirements. Scientific applications within a particular domain generally comply with domain conservation laws, which can be leveraged as an error detection criterion to study the resilience of this domain of applications sharing similar program characteristics. However, it is challenging to achieve application resilience: (a) how to identify the invariants of a given domain of applications, knowing the conservation laws, and (b) how to utilize the invariants to efficiently detect and recover from failures in application runs.
In this work, we target several continuum dynamics software packages, FleCSALE [1] and CODY [2] (with intrinsic invariants during computation), study their resilience to soft errors online (injected using an open-source fault injector), and investigate the opportunities for non-intrusive and lightweight failure recovery (checksum-based invariant checking). We propose a checksum-retry approach to achieve our goals, and experimental results on a virtualized platform with extensive fault injection campaigns demonstrate the effectiveness and efficiency of the proposed approach.
Comments: 18 pages
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
MSC classes: 68M15, 68M20, 68N20
Cite as: arXiv:1911.02114 [cs.DC]
  (or arXiv:1911.02114v1 [cs.DC] for this version)

Submission history

From: Li Tan [view email]
[v1] Tue, 5 Nov 2019 22:37:42 GMT (649kb)

Link back to: arXiv, form interface, contact.