“Transient Fault Recovery Techniques for the VLSI Processor Arrays”
by Elias S. Manolakos
May 1989
The high throughput rates necessary for real time signal/image processing applications can now be met cost effectively by VLSI technology. The VLSI processor arrays having a large number of regularly connected processing elements and using massive parallel processing provide a suitable form of supercomputing power in this application domain. Since most of the inevitable run time faults are of temporary nature transient fault recovery becomes a crucial issue, especially for arrays operating in a hostile environment that have to meet stringent time constraints.
A new distributed retries mechanism called Neighbor Assisted Recovery (abbreviated as NEAR) is proposed. As opposed to the traditional Self Retries (SR) mechanism, by utilizing some knowledge about the data dependencies structure of the algorithm under execution, we can achieve better utilization of the neighbors of a temporary failed processor. So the array can continue to provide useful service in a degraded fashion, even in the presence of multiple clustered processor errors. The optimal dynamic assistant assignment policy that guarantees the minimum recovery time overhead for any error pattern in a linear array has been constructed.
The optimal policy of NEAR can be efficiently implemented using localized decision -making and near neighbor communications. Using the OCCAM concurrent programming language and a board of Transputer processors a simulator of faulty tolerant linear arrays has been developed. After injecting an error pattern, the recovery time overheads imposed by NEAR or the traditional SR scheme can be measured and compared. For a large variety of error patterns, an average recovery speedup factor close to the theoretical value of two is observed.
If extra time can detection and location. The Time Redundancy with Dependency Graphs Interleaving for fault tolerance (TRI-ft), is a methodology that allows the user to create different time redundancy schemes and evaluate their merits early in the design phase. Tradeoffs like the expected roll-back path length versus the permanent error location capability can be studied.
In more general terms this dissertation is an attempt to unify the quest for fault tolerance with the mapping of regular algorithms to array architectures.