We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.DC

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Distributed, Parallel, and Cluster Computing

Title: Reliable Linear, Sesquilinear and Bijective Operations On Integer Data Streams Via Numerical Entanglement

Abstract: A new technique is proposed for fault-tolerant linear, sesquilinear and bijective (LSB) operations on $M$ integer data streams ($M\geq3$), such as: scaling, additions/subtractions, inner or outer vector products, permutations and convolutions. In the proposed method, the $M$ input integer data streams are linearly superimposed to form $M$ numerically-entangled integer data streams that are stored in-place of the original inputs. A series of LSB operations can then be performed directly using these entangled data streams. The results are extracted from the $M$ entangled output streams by additions and arithmetic shifts. Any soft errors affecting any single disentangled output stream are guaranteed to be detectable via a specific post-computation reliability check. In addition, when utilizing a separate processor core for each of the $M$ streams, the proposed approach can recover all outputs after any single fail-stop failure. Importantly, unlike algorithm-based fault tolerance (ABFT) methods, the number of operations required for the entanglement, extraction and validation of the results is linearly related to the number of the inputs and does not depend on the complexity of the performed LSB operations. We have validated our proposal in an Intel processor (Haswell architecture with AVX2 support) via fast Fourier transforms, circular convolutions, and matrix multiplication operations. Our analysis and experiments reveal that the proposed approach incurs between $0.03\%$ to $7\%$ reduction in processing throughput for a wide variety of LSB operations. This overhead is 5 to 1000 times smaller than that of the equivalent ABFT method that uses a checksum stream. Thus, our proposal can be used in fault-generating processor hardware or safety-critical applications, where high reliability is required without the cost of ABFT or modular redundancy.
Comments: to appear in IEEE Trans. on Signal Processing, 2016
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as: arXiv:1604.04740 [cs.DC]
  (or arXiv:1604.04740v1 [cs.DC] for this version)

Submission history

From: Yiannis Andreopoulos [view email]
[v1] Sat, 16 Apr 2016 12:30:12 GMT (873kb)

Link back to: arXiv, form interface, contact.