We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.DC

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Distributed, Parallel, and Cluster Computing

Title: Rolex: Resilience-Oriented Language Extensions for Extreme-Scale Systems

Abstract: Future exascale high-performance computing (HPC) systems will be constructed from VLSI devices that will be less reliable than those used today, and faults will become the norm, not the exception. This will pose significant problems for system designers and programmers, who for half-a-century have enjoyed an execution model that assumed correct behavior by the underlying computing system. The mean time to failure (MTTF) of the system scales inversely to the number of components in the system and therefore faults and resultant system level failures will increase, as systems scale in terms of the number of processor cores and memory modules used. However every error detected need not cause catastrophic failure. Many HPC applications are inherently fault resilient. Yet it is the application programmers who have this knowledge but lack mechanisms to convey it to the system.
In this paper, we present new Resilience Oriented Language Extensions (Rolex) which facilitate the incorporation of fault resilience as an intrinsic property of the application code. We describe the syntax and semantics of the language extensions as well as the implementation of the supporting compiler infrastructure and runtime system. Our experiments show that an approach that leverages the programmer's insight to reason about the context and significance of faults to the application outcome significantly improves the probability that an application runs to a successful conclusion.
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as: arXiv:1605.01994 [cs.DC]
  (or arXiv:1605.01994v2 [cs.DC] for this version)

Submission history

From: Saurabh Hukerikar [view email]
[v1] Fri, 6 May 2016 16:23:09 GMT (1549kb,D)
[v2] Mon, 23 May 2016 04:16:21 GMT (1549kb,D)

Link back to: arXiv, form interface, contact.