会议论文

A Deep Technical Review of nZDC Fault Tolerance 收藏

nZDC容错技术综述

作者

Minli Liao[1];Sam Ainsworth[2];Lev Mukhanov[3];Timothy M. Jones[1]

作者单位

[1]University of Cambridge, Cambridge, United Kingdom;[2]University of Edinburgh, Edinburgh, United Kingdom;[3]Queen Mary University of London, London, United Kingdom

发布日期

March 1 - 2, 2025

页码

104 - 116

DOI

10.1145/3708493.3712688

来源信息

CC: Compiler Construction ISBN：9798400714078, 2025年, 卷, 104 - 116页

摘要

Faults within CPU circuits, which generate incorrect results and thus silent data corruption, have become endemic at scale. The only generic techniques to detect one-time or intermittent soft errors, such as particle strikes or voltage spikes, require redundant execution, where copies of each instruction in a program are executed twice and compared. The only software solution for this task that is open source and available for use today is nZDC, which aims to achieve “near-zero silent data corruption” through control- and data-flow redundancy. However, when we tried to apply this to large-scale workloads, we found it suffered a wide set of false positives, negatives, compiler bugs and run-time crashes, which meant it was impossible to benchmark against. This document details the wide set of fixes and workarounds we had to put in place to make nZDC work across full suites. We provide many new insights as to the edge cases that make such instruction duplication tricky under complex ISAs such as AArch64 and their similarly complex ABIs. Evaluation across SPECint 2006 and PARSEC with our extensions takes us from no workloads executing to all bar four, with 2× and 1.6× geomean overhead respectively relative to execution with no fault tolerance.

摘要译文

CPU电路中的故障产生不正确的结果，从而导致无声的数据损坏，已经成为大规模的流行。检测一次性或间歇性软错误（例如粒子撞击或电压尖峰）的唯一通用技术需要冗余执行，其中程序中的每个指令的副本被执行两次并进行比较。 nZDC是目前唯一一种开放源码且可供使用的软件解决方案，它旨在通过控制和数据流冗余实现“接近零的无声数据损坏”。然而，当我们尝试将其应用于大规模工作负载时，我们发现它遇到了一系列误报、否定、编译器错误和运行时崩溃，这意味着无法对其进行基准测试。本文档详细介绍了为使nZDC跨整个套件工作而必须实施的广泛修复和变通方案。我们提供了许多关于边缘情况的新见解，这些边缘情况使得此类指令复制在复杂的ISA（如AAch64及其类似的复杂ABI）下变得棘手。通过我们的扩展对SPECint 2006和PARSEC进行评估，我们从不执行工作负载到全部执行工作负载，相对于没有容错的执行，分别有2倍和1.6倍的几何开销。

Minli Liao[1];Sam Ainsworth[2];Lev Mukhanov[3];Timothy M. Jones[1]. A Deep Technical Review of nZDC Fault Tolerance[C]//CC '25: Proceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction, Las Vegas NV USA, March 1 - 2, 2025, US: ACM, 2025: 104 - 116