COARSE-GRAINED DYNAMICALLY RECONFIGURABLE ARCHITECTURE WITH FLEXIBLE RELIABILITY

Dawood ALNAJJAR†, Younghun KO†, Takashi IMAGAWA†, Hiroaki KONOURA†, Masayuki HIROMOTO‡, Yukio MITSUYAMA†, Masanori HASHIMOTO‡, Hiroyuki OCHI‡, and Takao ONOYE‡

†Dept. Information Systems Engineering, Osaka University, Japan & JST CREST
‡Dept. Communications and Computer Engineering, Kyoto University, Japan & JST CREST

ABSTRACT

This paper proposes a coarse-grained dynamically reconfigurable architecture, which offers flexible reliability to soft errors and aging. A notion of cluster is introduced as a basic element of the proposed architecture, each of which can select four operation modes with different levels of spatial redundancy and area-efficiency. Evaluation of permanent error rates demonstrates that four different reliability levels can be achieved by the proposed architecture. We also evaluate aging effect due to NBTI, and illustrate that alternating active cells with resting ones periodically will greatly mitigate the aging process with negligible power overhead. The area of additional circuits to attain immunity to soft errors and reliability configuration is 26.6% of the proposed reconfigurable device. Finally, a fault-tolerance evaluation of Viterbi decoder mapped on the proposed architecture suggests that there is a considerable trade-off between reliability and area overhead.

1. INTRODUCTION

With the aggressive process scaling, sustaining reliability has become a major concern in VLSI design. As devices are miniaturized, critical charge, which is the minimum charge to cause a bit flip, becomes smaller, and functional correctness has been threatened by soft errors. In addition, negative bias temperature instability (NBTI) has become more prominent, and designers are encountering complications in the guaranteeing of device life time due to the aging process. On the other hand, reliability requirements depend on applications and operating environment, and hence, a design scheme that can flexibly choose countermeasures to reliability degradation is demanded.

To attain immunity to soft errors, soft error tolerant design has been attempted, especially in aerospace applications. For this purpose, time redundancy, spatial redundancy, and error correction coding (ECC) have been widely studied and utilized to detect a soft error and avoid a failure[1, 2, 3]. Especially for an application implementation with spatial redundancy, a reconfigurable device is suitable since redundant hardware, e.g. triple modular redundancy (TMR), can be easily realized thanks to the regular array structure. Besides, voters and ECC circuits are essential elements in attaining immunity to soft errors. In case of using fine-grained reconfigurable devices such as FPGAs, voters or ECC circuits can be implemented by LUTs in any part of the device and as many as necessary. In contrast, coarse-grained reconfigurable devices suffer from inefficiency in implementing voters or ECC circuits, since conventional coarse-grained reconfigurable architectures have no reliability consideration and do not equip such functionalities in their basic elements. Therefore, a reliability enhancement mechanism is demanded for coarse-grained reconfigurable architectures.

In order to mitigate NBTI, the dependency of NBTI on design and operating parameters has been studied. PMOS transistors age when negative gate-to-source voltage (\(V_{gs}\)) is applied, which is referred to as stress [4], causing a shift in the threshold voltage (\(V_{th}\)), and subsequently, the circuit delay increases. Once the stress is removed, the \(V_{th}\) is partially recovered. Consequently, the ratio of stress and recovery phases and their frequency are found to be important factors to characterize NBTI[5]. Especially long-term stress, such as 10 seconds, heavily degrades performance. Therefore, avoiding such a stressful operation helps to extend device life time. In case of reconfigurable devices, the replacement of active functional units with resting spare units is promising. In addition, the replacement without downtime, which is referred to as “hot-swapping”, is desirable from the perspective of application execution.

Motivated by these tendencies, the present paper proposes a coarse-grained dynamically reconfigurable architecture with reliability enhancements. The functionality and interconnect architecture of the reconfigurable device are based on our previous architecture[6], which offers media processing capabilities such as multi-standard video decoding. This paper focuses on the additional mechanism to change reliability levels depending on applications and environments. For reliability-oriented applications, the reconfigurable architecture achieves a sufficient level of reliability at the cost of area and power overhead, while for cost-oriented applications it provides area/power-efficiency. In addition, it can be also noted that reliability requirements are different even among circuit modules within an application, e.g. control parts of an application may require higher reliability level than that of datapath parts. Circuits on timing critical paths with less delay margin are more sensitive to delay increase due to aging, and NBTI recovery by hot-swapping is demanded. We devise a scheme where the reliability level
can be selected individually for each basic element of the reconfigurable architecture to reduce area and power overhead by avoiding excessive reliability as much as possible.

To utilize the flexible reliability scheme, it is necessary to find out the required immunity to soft errors for each basic element. For that purpose, we developed a fault-tolerance evaluation method. In this paper, we also measure the number of sensitive bits in the configuration memory as an index of vulnerability, and demonstrate that there is a considerable difference among basic elements.

2. FLEXIBLE RELIABILITY IN ARCHITECTURE DESIGN

In order to achieve appropriate reliability with area efficiency, each basic element of the reconfigurable device should be able to change its own reliability referring to the sensitivity to soft errors and to circuit aging. This section reviews reliability classifications, and defines operation modes of basic elements with different reliability levels. Throughout this paper, a single soft error in a memory element and a soft error in combinational logic will be referred to as a single event upset (SEU) and a single event transient (SET), respectively.

2.1. Reliability classification to soft errors

Reliability of the configuration memory is often considered more seriously than that of the computed data, since an SEU on the configuration memory damages the functionality until the configuration data is reloaded again, which we will be referring to as a permanent error throughout the paper. Focusing on soft errors, we suppose the following four conditions CS1-CS4 that we aim to satisfy with basic elements of the reconfigurable architecture in this study:

CS1: functionality must be correct, and computed data must be correct as well,
CS2: functionality must be correct, and errors in computed data can be detected, however, only some of them can be corrected,
CS3: functionality must be correct, and errors in computed data are not considered,
CS4: no consideration for error detection and recovery is necessary.

2.2. Reliability classification to aging

In a coarse-grained reconfigurable device, aging mitigation of the data processing elements, especially those of which may compose critical paths, is necessary to avoid failure and sustain device life time. On the other hand, the circuits related to configuration memory never form speed-limiting paths. Accordingly, we suppose aging-oriented conditions CA1-CA2 that should be considered in architecture design:

CA1: aging effect in an execution module must be relaxed,
CA2: no consideration for aging mitigation is necessary.

2.3. Definition of four reliability levels

Taking conditions mentioned in both sections 2.1 and 2.2 into consideration, in order to achieve different levels of reliability to soft errors and to circuit aging, we define four reliability levels RL1-RL4 as summarized in Table 1. In the conditions of CS1 and CS2, we assume that the circuit aging mitigation in data processing elements should be considered, however, in CS3 and CS4 the circuit aging mitigation is compromised. Effective architecture for these four reliability levels with small area overhead will be described in what follows.

3. PROPOSED ARCHITECTURE FOR FLEXIBLE RELIABILITY

In this section, we present the proposed reconfigurable architecture with flexible reliability for soft errors and circuit aging.

3.1. Reconfigurable architecture for soft error tolerance

Figure 1 illustrates the overview of the proposed architecture. Having designed the architecture independent from the granularity, we will represent the granularity using the variable $n$. Clusters, which are basic elements of our architecture, are placed repeatedly in a two-dimensional array. A cluster has a switch (CFGSM: configuration memory switching matrix) and four cells, each of which consists of an execution module (EM) (in case of ALU and multiplier cluster) or a register module (RM) (in case of register cluster), three configuration memories (CFGs) for dynamic configuration of the EM/RM, and voters (VCs), all in a reconfigurable cell unit (RCU). To realize flexible dependability, the proposed architecture also introduces a redundancy control unit (RDU) and a comparing and voting unit (CVU).

Inter-cluster interconnection has four tracks (Track0–3), and each cell inside a cluster is placed on one of them. Thus, each cell in a cluster can be connected to the cells in adjacent
clusters on the same track. Inside a cluster, each cell can be connected to adjacent cells via a diagonal intra-cluster connection. This overall interconnection enables application mapping in all four operation modes, which is summarized in Table 2, and correspond to the four conditions RLI-RL4 mentioned in Table 1. TMR (RL1), double modular redundancy (DMR)(RL2), single modular with single context (SMS)(RL3), and single modular with multi-context (SMM)(RL4) are supported, which offer different reliability levels and different capabilities of dynamic reconfigurability (#contexts). In the case of TMR, DMR and SMS, as shown in Fig. 2, each cell has three redundant CFGs which contain one context, three VCs, a selector CS (a part of CFGSM) and the EM/RM. An SEU occurring in the CFG will be repaired when the next clock is given to the CFGs, since the voted value is rewritten to the CFGs in every clock cycle. On the other hand, in SMM mode, the voters are disabled, and three contexts are stored in the CFGs of each cell.

In TMR mode shown in Fig. 2, the outputs of three EMs/RMs pass through the three voters (VD), while the forth cell is reserved as a spare cell for hot-swapping operation shown in Sect. 3.2. An SET or SEU occurring in VC, CS and EM will be recovered in the VDs. An SET occurring in the VDs will propagate to successive clusters. With the prohibition of data feedback inside a cluster and the enforcement of voting at every output of the cells, the proposed architecture can avoid error accumulation in EM/RM without introducing rollback mechanism. In DMR mode, on the other hand, the outputs of the EMs along with the parity bits are directed to a comparator and selector (C&S). SEUs occurring in the registers of EM are detectable using parity bits, and can be recovered in C&S by selecting the correct output. However, SETs in VC, CS and EM can only be detected in the C&S. In the case of SMS and SMM, only SEUs in the registers of EM can be detected, while SETs in CS and EM will propagate to successive clusters.

The RDU configures the operation mode of the cluster and the context selection stored in the CFGs. The RDU has a 6-bit configuration data; 2 bits for the operation mode selection, 2 bits for cell usage selection for TMR and DMR, and the rest for the context selection. This configuration data is stored with bitwise TMR, and hence, the RDU is SEU-tolerable. The dynamic context selection can be carried out just by changing the 2 bits in the RDU.

The CFG memory size is different for ALU, multiplier and register clusters and is dependent on $n$. The CFG memory configures the interconnect and the EM/RM of the cell, which will be explained further in this section. The CFGSM is responsible for selecting a context for each cell, and enabling dynamic reconfiguration and four operation modes.
3.2. Reconfigurable architecture for aging mitigation

In order to mitigate circuit aging, hot-swapping operation, which can replace active functional units with resting spare units during runtime, is performed in both TMR and DMR modes. Figure 3 depicts the organization of four EMs with hot-swapping multiplexers labeled with letter H. Through time, one of the EMs rest alternately while the others perform data processing on the input data 0, 1 and 2, e.g. EM0, EM1, and EM2 are replaced with EM0, EM1, and EM3, respectively after one hot-swapping cycle. In TMR mode, there are three redundant cells and one spare cell as explained in Sect. 2.3, and three redundant cells are in stress phase, while one spare cell is in recovery phase. A cell-swapping operation can alternate each active cell and a spare cell. On the other hand, in DMR mode, there are a pair of redundant cells and two spare cells or two pairs of redundant cells. In the same manner as in TMR mode, a cell-swapping operation can alternate an active cell and a spare cell, and can mitigate aging process.

3.3. Functionality of reconfigurable cells

In ALU cluster, each EM in the cell has an n-bit ALU, a shifter and a parity error detector (PED) as depicted in Fig. 4. According to the configuration memory, the EM can be configured to perform addition and subtraction operations with or without cooperation of the neighboring cells. It also can be configured to perform logical operations such as logic AND and OR, multiplexing, and fixed or variable shifting. When using SMS and SMM mode, each cell in a cluster can cooperate with the neighboring cells so as to perform a multi-byte operation. For example, four cells can work collaboratively to perform a 4n-bit operation.

The EM in the multiplier cluster contains a multiplier, a PED and a shifter. Similarly, corresponding to the bit string stored in CFG memory, the multiplier can be configured to perform \( n \times n \) bit signed/unsigned multiplication. However, unlike ALU cluster, the maximum bit operation is \( n \) in all operation mode. EMs for both clusters have two \( n \)-bit registers for input (A, B), four 1-bit registers, and a result register (Y), which is 2n-bits for the multiplier cluster, and \( n \)-bits for the ALU cluster. The CFG memories also control the configuration of all registers. They can be configured to either be pipeline registers and store new data every cycle, or be disabled, or fixed to store a constant value.

On the other hand, RMs in register clusters contain a 16-word register file with word size of \( n \)-bit. The register file can work not only as a register file, but also as a delay unit, which outputs the input data after 1-16 cycles, or as a LUT. In the case of TMR, simple voting at the output of \( n \)-bit registers for input (A, B), four 1-bit registers, and a result register (Y), which is 2n-bits for the multiplier cluster, and \( n \)-bits for the ALU cluster. The CFG memories also control the configuration of all registers. The EM in the multiplier cluster contains a multiplier, a PED and a shifter. Similarly, corresponding to the bit string stored in CFG memory, the multiplier can be configured to perform \( n \times n \) bit signed/unsigned multiplication. However, unlike ALU cluster, the maximum bit operation is \( n \) in all operation mode. EMs for both clusters have two \( n \)-bit registers for input (A, B), four 1-bit registers, and a result register (Y), which is 2n-bits for the multiplier cluster, and \( n \)-bits for the ALU cluster. The CFG memories also control the configuration of all registers. They can be configured to either be pipeline registers and store new data every cycle, or be disabled, or fixed to store a constant value.

On the other hand, RMs in register clusters contain a 16-word register file with word size of \( n \)-bit. The register file can work not only as a register file, but also as a delay unit, which outputs the input data after 1-16 cycles, or as a LUT. In the case of TMR, simple voting at the output of three cells is not an adequate measure, since the soft error will remain in the register file. Consequently, we perform a successive data write of the voted value after every data read in the TMR mode to insure the elimination of occurring soft errors. DMR mode, on the other hand, is not implemented in the register cluster.

![Fig. 3. Organization of hot-swapping multiplexers and hot-swapping operations.](image-url)

![Fig. 4. Execution module (EM) for Multiplier and ALU clusters (n=8).](image-url)
4. ARCHITECTURE EVALUATION

This section evaluates the proposed architecture in terms of permanent soft error rates, circuit aging, and area overhead.

4.1. Soft error reliability

The permanent error rates in different reliability modes are analytically calculated and evaluated in this section.

4.1.1. Preparation

Let \( \lambda_U \) denote SEU rate of 1 bit memory element. As for SET, we assume that SET rate is proportional to the area of a combinational circuit. We thus calculate SET with \( \lambda_U \), which is the SET rate in a single gate. We here assume that \( \lambda_U \) only includes SETs that are captured in FFs, while SETs that are not filtered out by electrical, logical and temporal maskings are not included. SEU rates in memory blocks such as CFG, EM/RM, and RDU are expressed as \( \lambda_U \) multiplied by the number of memory bits. In contrast, SET rates in combinational logic or partial combinational logic such as VC, CS, EM/RM, and VD are calculated based on their area. We enumerated all cases in which a permanent error could occur. We then derived analytical expressions of error rates that correspond to the enumerated cases. And then, we evaluated the error rates of each mode using the derived expressions.

4.1.2. Discussions on reliability in four operation modes

The permanent error rates of four operation modes are evaluated. Supposing this device will be utilized for aerospace applications, we assume an SEU rate on the satellite orbit [7], and \( \lambda_U = 2.0 \text{ FIT}^1 \) is used for the evaluation. As for SET rate \( \lambda_T \), it is difficult to choose an appropriate value, and hence we evaluated the permanent error rate with various \( \lambda_T \). Figure 5 shows the results of using a 100MHz clock for the configuration memory and EMs for the ALU cluster.

The permanent error rate of ALU cluster in TMR mode is about \( 10^{-16} \text{ FIT} \), and high reliability is attained. The permanent error rate of DMR mode depends on \( \lambda_T / \lambda_U \), because an SET in EM is detectable but is not correctable. When SET rate \( \lambda_T \) is much smaller than \( \lambda_U \), DMR provides moderate reliability level between TMR and SMS. On the other hand, when \( \lambda_T \) is comparable to \( \lambda_U \), the permanent error rate of DMR is close to that of SMS. When we use DMR, SET rate should be carefully examined.

The permanent error rate of SMS is higher than DMR and TMR, as we expected. At a glance, you might think that SMS and SMM have quite similar reliability, because in this evaluation all errors are treated equally. However, in SMM mode, the configuration information is not protected, and the functionality of the circuit is destroyed by an SEU/SET, whereas the configuration memory is protected in SMS. Thus, the reliability of SMS and SMM are different.

4.2. Aging process

Aging mitigation by hot-swapping heavily depends on the swapping cycle time \( T_h \). On the other hand, power consumption due to power gating for recovery is also dependent on \( T_h \). We evaluated the worst \( V_{th} \) shift using the NBTI long-term prediction model[5] assuming that all PMOSs in a cell are under stress when the cell is active. In TMR mode, stress phase remains for 3\( T_h \) followed by recovery phase of \( T_h \). In DMR mode, stress and recovery phases change alternately in \( T_h \). We gave the estimated \( V_{th} \) shift to a commercial transistor-level static timing analyzer and evaluated the increase in circuit delay. We also evaluated power dissipation due to power gating by circuit simulation. Fig. 6 illustrates the circuit delay degradation after ten years and the power consumption as a function of \( T_h \). By assigning \( T_h \) to values from \( 10^{-3} \) sec to \( 10^{-4} \) sec, the delay degradation is reduced to \( 6.2\% \) within \( 2\% \) power overhead, whereas the delay degradation would reach up to \( 35\% \) without hot-swapping.

Figure 7 illustrates the circuit delay degradation as \( T_h = 10^{-4} \) sec. As we assume that the lifetime equals to \( 10\% \) degradation of circuit delay, lifetime of a cluster without hot-swapping is less than 1 year. However, those of TMR and DMR mode with hot-swapping are more than 10 years. Therefore, this clearly demonstrates that the hot-swapping operation can effectively mitigate circuit aging.

4.3. Area overhead

We show the area overhead that is introduced to attain immunity to soft errors and realize four operation modes. To analyze the overhead quantitatively, we compared the number of gates of the proposed architecture with that of a baseline architecture containing minimum hardware enough to perform dynamic reconfiguration properly, but which is not immune to SEU and SET. The baseline architecture is thought to have CFGs, some part of the CFGSM for context selection, EMs/RMs, and interconnect. In contrast, the proposed

---

1 FIT = \( 1 \times 10^{-9} \text{ error/hour} \)
architecture includes CFGs, VCs, EMs/RMs, RCU, CVU, and interconnect. The gate count of both the proposed architecture and the baseline architecture is estimated inRTL design with an industrial 90nm cell library and Synopsis Design Compiler.

The gate counts of units in each cluster are listed in Table 3, where $n=8$-bit. The area of additional circuits to provide flexible dependability occupies 21.0% to 30.5% of the total area and the average is 26.6%. Most of area overhead arises from voters for configuration memory (VCs), and the other part is limited. On the other hand, the overhead varies depending on the data width of the architecture. When $n=16$ and 32, the area overhead of ALU cluster is reduced to 25.6% and 19.7%, respectively.

Table 3. Area overhead ($n=8$).

<table>
<thead>
<tr>
<th>Block Name</th>
<th>ALU Cluster</th>
<th>Mult. Cluster</th>
<th>Reg. Cluster</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Area</td>
<td>Overhead</td>
<td>Area</td>
</tr>
<tr>
<td>CVU</td>
<td>378</td>
<td>378</td>
<td>684</td>
</tr>
<tr>
<td>RDU</td>
<td>194</td>
<td>194</td>
<td>194</td>
</tr>
<tr>
<td>CFGSM</td>
<td>2,053</td>
<td>793</td>
<td>1,770</td>
</tr>
<tr>
<td>EMs/RMs</td>
<td>3,054</td>
<td>196</td>
<td>4,414</td>
</tr>
<tr>
<td>CFGs</td>
<td>6,048</td>
<td>-</td>
<td>5,184</td>
</tr>
<tr>
<td>VCs</td>
<td>4,186</td>
<td>4,186</td>
<td>3,611</td>
</tr>
<tr>
<td>Interconnect</td>
<td>2,955</td>
<td>-</td>
<td>3,446</td>
</tr>
<tr>
<td></td>
<td>18,868</td>
<td>5,747</td>
<td>19,303</td>
</tr>
</tbody>
</table>

As a sample application, Viterbi decoders (constraint length is 3) was manually mapped on the cluster array as stated below. As illustrated in Fig. 8, the process of Viterbi decoding is divided into three parts: branch metric unit, path metric unit, and path memory unit. Each unit is implemented in two ways: one is that all clusters of the unit are configured as TMR mode (denoted by T), the other is that configured as SMM mode (denoted by S). Combining these three units, eight patterns of Viterbi decoders are obtained. The pattern is denoted by three characters (e.g. T-S-T) where the first, second, and third characters correspond to the mode of branch metric unit, path metric unit, and path memory unit, respectively.

Figure 9 describes the number of required clusters and the number of sensitive bits in each configuration pattern. In “S-S-S”, the number of sensitive bits is 790 with 35 clus-

Fig. 6. Trade-off between aging mitigation and power consumption (after 10 years).

Fig. 7. Comparison of circuit delay degradations.

Fig. 8. Overview of Viterbi Decoder.
Fig. 9. Area-reliability trade-off of the Viterbi Decoder.

ters, while in “T-T-T” the number of sensitive bits is 0 with 70 clusters. Therefore, there is a considerable trade-off between area and reliability. This evaluation result suggests that configuring the operation mode of each cluster individually can improve the area-reliability trade-off.

6. CONCLUSIONS

A coarse-grained dynamically reconfigurable architecture, in which four operation modes with different reliability and area-efficiency can be selected for each cluster, has been proposed. Evaluation results of permanent error rates show that the proposed reconfigurable architecture can realize flexible reliability to soft errors through four operation modes. Evaluation results of aging process show that the circuit delay degradation can be mitigated by hot-swapping operation with power increase of 2%. The area overhead to attain considerable mitigation and provide flexible reliability accounts for 26.6% of the proposed coarse-grained dynamically reconfigurable device. In addition, fault-tolerance evaluation based on sensitive bits of Viterbi decoder suggests that the variation of the number of sensitive bits at each cluster could be utilized to improve the trade-off between reliability and area overhead. As future work, we will evaluate the performance impact of applications implemented on the reconfigurable architecture with different reliability levels.

7. ACKNOWLEDGMENT

The authors would like to thank the project members of JST CREST of Kyoto University, Kyoto Institute of Technology, Nara Institute of Science and Technology, and ASTEM RI for their discussions.

8. REFERENCES


