Storage clusters for managing immutable big data are susceptible to transient and permanent failures at both node and rack levels, and have increasingly employed erasure codes to achieve data availability. However, most existing erasure codes not only storage-inefficiently use an entire rack of parity information to tolerate a partial rack of node failures, but also cause the recovery of the common single-node failures to consume a significant amount of oversubscribed cross-rack bandwidth.
To relieve both storage and recovery burdens of erasure codes, this project studies a new family of codes called recovery-oriented STAIR (R-STAIR) codes, which not only provide storage-efficient fault tolerance for mixed node and rack failures, but also achieve rack-local recovery for the common single-node failures. R-STAIR codes augment our previously proposed STAIR codes for a storage cluster setting. We demonstrate the usability of R-STAIR codes by implementing them in a practical Hadoop cluster.
This project is done by Advanced Network and System Research Laboratory in the Department of Computer Science and Engineering at the Chinese University of Hong Kong (CUHK).
The work is supported by grants AoE/E-02/08 and ECS CUHK419212 from the University Grants Committee of Hong Kong