# A Power and Delay Optimization Method Using Input Reordering in Cell-Based CMOS Circuits

Masanori HASHIMOTO<sup> $\dagger$ </sup>, Student Member, Hidetoshi ONODERA<sup> $\dagger$ </sup>, and Keikichi TAMARU<sup> $\dagger$ </sup>, Members

SUMMARY We present a method for power and delay optimization by input reordering. We observe that the reordering has a significant effect on the power dissipation of the gate which drives the reordered gate. This is because the input capacitance depends on the signal values of other inputs. This property, however, has not been utilized for power reduction. Previous approaches focus on the reduction of the power dissipated by internal capacitances of the reordered gate. We propose a heuristic algorithm considering the total power consumed in the driving gate and the reordered gate. Experimental results using 30 benchmark circuits show that our method reduces the power dissipation in all the circuits by 5.9% on average. There is a possibility that power dissipation is reduced by 22.5% maximum. In the case of delay and power optimization, our method reduces delay by 6.7% and power dissipation by 5.3% on average.

 ${\it key \ words:}$  input reordering, transistor reordering, power estimation

## 1. Introduction

Reducing power dissipation is one of the most principal subjects in VLSI design today. The main driving factor is that the chips used in portable environment are increasing drastically. In addition, there are many other factors that require low power strongly, such as reducing package costs, maintaining high reliability. In the various stages of the VLSI design, many techniques for power reduction have been proposed, such as supply-voltage scaling [1], [2], technology mapping for low power [3], gate sizing [4], input reordering[5]-[9], and so on. The technique of input reordering has two advantages. The first advantage is that input reordering has little effect on the layout area. The second is that other techniques can be combined easily with input reordering. In [5], the authors proposed that an input with high switching probability should be connected with a pin which has small input capacitance. Here, a small input capacitance means that the size of the input transistor is small. However, this strategy is not effective in cell-based design because the pins that are equivalent logically have the same transistor size in most standard cell libraries. In [6]–[9], the authors discussed input reordering for power reduction such that the reordering reduces the power dissipation inside the

reordered gate. The input reordering, however, affects not only the power dissipated inside the reordered gate but also the power dissipated by a fan-in gate and fanout gates, where the fan-in gate is the gate which drives the reordered gate and the fan-out gates are the gates driven by the reordered gate. The effect of input reordering appeared in the fan-in gate comes from the fact that the input capacitance of the reordered gate differs depending on the signal values of other inputs, as we will explain later in detail. As a result, dynamic power dissipation of the fan-in gates changes according to the input reordering of the reordered gate. The variation of input capacitances has not been utilized for power reduction previously. In this paper, we discuss the effects of input reordering on power dissipation in the fan-in gate and the fan-out gates as well as in the reordered gate, and propose an improved method for power optimization which exploits the effects.

This paper is organized as follows. Section 2 discusses the effects of input reordering on power dissipation and delay. Section 3 discusses strategies of input reordering for power and delay optimization at each gate. Section 4 introduces an algorithm of input reordering for power and delay optimization for the whole circuit. Section 5 shows the experimental results of our method. Finally Sect. 6 concludes the discussion.

## 2. The Effects on Power Dissipation and Delay

In this section, we discuss the major effects of input reordering in the fan-in gate, the reordered gate and the fan-out gate. So far, only the effect for the reordered gate has been considered for performance optimization. We show that there is a notable effect of input reordering on power dissipation in the fan-in gate, which could be utilized for performance optimization as well.

#### 2.1 Fan-in Gate

The dynamic power dissipated by a fan-in gate varies by the input reordering of the reordered gate (the gate that the fan-in gate drives). This is because the input capacitance of the reordered gate, i.e. the load capacitance of the fan-in gate, depends on the signal values of other inputs of the reordered gate. We demonstrate the difference numerically using an example from a real

Manuscript received July 1, 1998.

Manuscript revised September 24, 1998.

<sup>&</sup>lt;sup>†</sup>The authors are with Department of Communications and Computer Engineering, Kyoto University, Kyoto-shi, 606–8501 Japan.

Fig. 1 The effect of input reordering in a 2-input NAND gate.

 $0.7 \,\mu\text{m}$  standard cell library. Figure 1 shows a 2-input NAND gate with inputs A and B, two nMOSFETs NA and NB in series, being NA closer to the output. When the input B keeps low, the input capacitance of A is 25 fF. When the input B keeps high, the input capacitance of A becomes 41 fF which is 64% larger than the previous case. The difference of the input capacitance (16 fF) is larger than the internal capacitance ( $C_B = 11 \,\text{fF}$ ) which is the sum of the diffusion capacitances of the source (NA) and the drain (NB).

The input capacitance of A depends on whether the source of NA is connecting to ground or not. Let us show that the input capacitance is small when the source of the input transistor is floating from ground, using a 3-input NAND gate (Fig. 2) as an example. Figure 2 shows the method of measuring the input capacitance. A current meter *i* is added to measure the current poured into the input capacitance of B. The voltage source  $V_{in}$  generates a ramp waveform changing from 0 to  $V_{DD}$ . The integration of the current yields the charge  $Q_B$  poured into the input B. The input capacitance of B,  $C_B$ , is calculated as

$$C_B = \frac{Q_B}{V_{DD}}.$$
(1)

Table 1 lists  $C_B$  under various conditions of other inputs and the initial voltage of internal nodes. The rightmost column (Ratio) indicates the ratio of the input capacitance under various conditions with respect to the value when both of inputs A and B are kept high. From Table 1, we can observe that the signal value of the input C affects  $C_B$  strongly. In other words,  $C_B$ becomes small when the source of NB is floating from ground. Compared with the input C, the input A and the initial value of internal nodes have minor influence. Therefore we characterize the input capacitance under following two conditions; the condition that the source



 Table 1
 Input capacitance of B under various conditions.

| Input A                                                     | Input C | Node $n_1$                      | Node $n_2$       | Input Capa- | Ratio |  |
|-------------------------------------------------------------|---------|---------------------------------|------------------|-------------|-------|--|
|                                                             |         |                                 |                  | citance(fF) | (%)   |  |
| High                                                        | High    | -                               | -                | 40          | -     |  |
| Low                                                         | High    | $High^{\dagger}$                | -                | 38          | 95    |  |
| Low                                                         | High    | Low                             | -                | 37          | 93    |  |
| High                                                        | Low     | -                               | $High^{\dagger}$ | 25          | 63    |  |
| High                                                        | Low     | -                               | Low              | 24          | 60    |  |
| Low                                                         | Low     | $High^{\dagger}$                | $High^{\dagger}$ | 24          | 60    |  |
| Low                                                         | Low     | $\operatorname{High}^{\dagger}$ | Low              | 26          | 65    |  |
| Low                                                         | Low     | Low                             | Low              | 26          | 65    |  |
| High: $V_{DD}$ High <sup>†</sup> : $V_{DD} - V_{TH}$ Low: 0 |         |                                 |                  |             |       |  |

of the input transistor is connecting to ground, and the condition that the source is floating ground.

#### 2.2 Reordered Gate

Internal capacitances in a reordered gate have an influence on the power dissipation, delay time, and transition time of the reordered gate. References [6]-[9] discuss methods for power reduction by input reordering such that the number of charging and discharging the internal capacitances could be reduced. Let us take a 4-input NAND gate as an example to investigate how power dissipation and delay vary input by input. Table 2 lists the power dissipation (dissipated energy, rigorously), rise/fall delay times and transition times when the output load capacitance is 60 fF and the transition time of the input signal is 0.4 ns. The gate is driven by input A or D, where input A is closest to the output and input D is closest to ground. The dissipated power (energy), rise delay time and rise transition time of input D are larger than those of input A by 79%, 69%, 78%, respectively.

Rise delay/transition times as well as fall delay/transition times show input pin dependencies. Even in transitions driven by a parallel-connected transistor (eg. output rise/fall for NAND/NOR gates), there exists the distinct input-pin dependency. This





|                          | Pin A | Pin D | Pin D/Pin A |
|--------------------------|-------|-------|-------------|
| Power(pJ)                | 3.8   | 6.8   | 179%        |
| Rise Delay(ns)           | 0.26  | 0.44  | 169%        |
| Fall Delay(ns)           | 0.18  | 0.23  | 128%        |
| Rise Transition Time(ns) | 0.41  | 0.73  | 178%        |
| Fall Transition Time(ns) | 0.34  | 0.33  | 97%         |

 Table 2
 Typical characteristics of a 4-input NAND gate.

is because the amount of capacitances to be charged, which includes the internal capacitances between seriesconnected MOSFETs, depends on the location of the driving (input) pin. In Ref. [10], the input-pin dependency of the transitions driven by a parallel-connected transistor is neglected in its timing optimization process. The above example means that this simplification is not reasonable. The dependencies in both transitions make a delay optimization process not so straightforward, as described later.

## 2.3 Fan-out Gate

Input reordering of the reordered gate affects the power dissipation of a fan-out gate. This is because the reordering changes the transition time of the input signal of the fan-out gate, which leads to the change in the short-circuit current of the fan-out gate. If the transition time is short, the short-circuit power dissipation in the fan-out gate becomes small. This effect, however, is secondary compared to those of the fain-in gate and the reordered gate. We therefore do not have a further discussion on fan-out gates.

#### 3. Reordering Strategies

In this section, we discuss the reordering strategies for each effect discussed in the previous section. The overall algorithm which combines the strategies for optimizing the total performance of the circuits will be shown in the next section.

#### 3.1 Definitions

We define the primary input signal x[n], a synchronized discrete-time logic signal as

$$x[n] = x(nT) = x(t)|_{t=nT},$$
 (2)

where n is an integer and T is the period of the system clock. The signal probability P(x) and transition rate R(x) are defined as follows.

$$P(x) = \lim_{k \to \infty} \frac{1}{k} \sum_{n=1}^{k} x[n].$$
 (3)

$$R(x) = \lim_{t \to \infty} \frac{n_x(t)}{t},\tag{4}$$



where  $n_x(t)$  is the number of transitions of x(t) between a time interval of length t, and  $n_x(t)$  includes glitch transitions.

We represent a static CMOS gate as a directed acyclic graph (V, E)[7].  $V = \{n_0, \cdots, n_{p-1}, y, vdd,$ gnd is the set of nodes, where  $(n_0, \dots, n_{n-1})$  are the internal nodes of the gate, (y) is the output node and (vdd, gnd) are the power and ground nodes. E represents the 2q transistors (q of pMOS and q of nMOS) which connect the nodes in V. Each edge has a label representing the logical condition that the transistor corresponding to the edge is conductive. The graph of AOI31 gate (Fig. 3) is represented as Fig. 4. We define the boolean function  $H_{n_k}$  that represents a logical sum of all possible paths from vdd to  $n_k$ , where each path is represented as a logical product of the label of the edges on the path. In the example of AOI31 gate,  $H_y$ is represented as  $(\overline{A} + \overline{B} + \overline{C}) \cdot \overline{D}$ . Similarly  $G_{n_k}$  is the boolean function that represents all possible paths from  $n_k$  to gnd. Boolean function  $K_{n_k \to n_l}$  represents all possible paths from  $n_k$  to  $n_l$ .

## 3.2 Power Dissipation in Fan-in Gate

We explain the strategy for reducing the power dissipated in fan-in gates. To consider the effect that the input capacitance depends on other inputs, we introduce effective input capacitance as an integral average of the input capacitance. The input reordering changes the effective input capacitance. Therefore, if the input with high transition rate have smaller effective input capacitance, the power dissipation in the fan-in gate becomes smaller.

In Sect. 2.1, we say that the input capacitance becomes small when the source of the input n(p)transistor is floating from ground (power supply), which is not accurate for a complex gate. We take the input capacitance A of AOI31 gate (Fig. 3) as an example. Suppose the inputs (B, C, D) are (0, 1, 1). The source of transistor NA is not connected to ground through the transistors NB and NC. The input capacitance of A, however, does not look small. It is because the input transistor NA is connecting to ground through the transistor ND. Therefore in the case of nMOS, the input capacitance looks small when both the source and the drain are floating from ground. Similarly, in the case of pMOS, the input capacitance looks small when both the source and the drain are floating from power supply.

Now, we explain the calculation of the effective input capacitance. We define the boolean function  $GI_{X_i}$ which represents the logical condition that the source of NX<sub>i</sub> is connecting to ground, where NX<sub>i</sub> is the ntransistor of input  $X_i$ .

$$GI_{X_i} = G_{n_k},\tag{5}$$

where  $n_k$  is the node that corresponds to the source of NX<sub>i</sub>. Also we define the boolean function  $GI'_{X_i}$  which represents the logical condition that the drain of NX<sub>i</sub> is connecting to ground when NX<sub>i</sub> is not conductive.

$$GI'_{X_i} = \sum_{n_j \in V_n} K_{n_j \to n_l} \cdot G_{n_j}|_{X_i=0}, \tag{6}$$

where  $n_l$  is the node that corresponds to the drain of NX<sub>i</sub> and  $\sum$  represents the boolean OR operation.  $V_n$  is a subset of V which consists of node (y) and all the nodes in the nMOS network.  $(K_{n_j \to n_l} \cdot G_{n_j}|_{X_i=0})$  means the logical condition that node  $n_l$  is connected to ground via node  $n_j$  when NX<sub>i</sub> is not conductive. Using (5) and (6), the boolean function  $FG_{X_i}$ , which represents the condition that both the source and the drain of NX<sub>i</sub> are floating from ground, is represented as follows.

$$FG_{X_i} = \overline{GI_{X_i}} \cdot \overline{GI'_{X_i}}.$$
(7)

Similarly, we define the boolean functions  $HI_{X_i}$ and  $HI'_{X_i}$ .

$$HI_{X_i} = H_{n_m},\tag{8}$$

where  $n_m$  is the node that corresponds to the source of  $PX_i$ .  $PX_i$  is the p-transistor of input  $X_i$ .

$$HI'_{X_{i}} = \sum_{n_{j} \in V_{p}} K_{n_{n} \to n_{j}} \cdot H_{n_{j}}|_{X_{i}=1},$$
(9)

where  $n_n$  is the node that corresponds to the drain of PX<sub>i</sub>.  $V_p$  is a subset of V which consists of node (y) and all the nodes in the pMOS network. The boolean function  $FH_{X_i}$ , which represents the condition that both the source and the drain of PX<sub>i</sub> are floating from power supply, is represented as follows.

$$FH_{X_i} = \overline{HI_{X_i}} \cdot \overline{HI'_{X_i}}.$$
(10)

Using (7) and (10), the effective input capacitance of  $X_i$  is represented as follows.

$$C_{X_i}^{eff} = C_{PX_i} P(\overline{FH_{X_i}}) + C_{PX_i}^{float} P(FH_{X_i}) + C_{NX_i} P(\overline{FG_{X_i}}) + C_{NX_i}^{float} P(FG_{X_i}), (11)$$

where  $C_{PX_i}^{float}(C_{NX_i}^{float})$  is the gate capacitance of  $PX_i(NX_i)$  when the source and the drain are floating from power supply (ground), and  $C_{PX_i}(C_{NX_i})$  is the gate capacitance when the drain is connecting to power supply (ground).

In the case of input B of AOI31 gate (Fig. 4),  $GI_B$ ,  $GI'_B$ ,  $FG_B$ ,  $HI_B$ ,  $HI'_B$  and  $FH_B$  are represented as follows.

$$GI_B = C, (12)$$

$$GI'_B = AD + 1 \cdot 0 + 0 \cdot C = AD, \tag{13}$$

$$FG_B = \overline{C} \cdot \overline{AD},\tag{14}$$

$$HI_B = 1, (15)$$

$$HI'_B = \overline{D} \cdot (\overline{A} + \overline{C}) + (\overline{A} + \overline{C}) = \overline{A} + \overline{C}, \qquad (16)$$

$$FH_B = \overline{1} \cdot \overline{A} + \overline{C} = 0. \tag{17}$$

The effective input capacitance of B  $(C_B^{eff})$  is represented as follows.

H

$$C_B^{eff} = C_{PB} + C_{NB}P(\overline{\overline{C} \cdot \overline{AD}}) + C_{NB}^{float}P(\overline{C} \cdot \overline{AD}).$$
(18)

From the above discussion, we should reorder the inputs so as to decrease  $PW_{input}$  (the sum of the power dissipation of charging the input capacitances).

$$PW_{input} = \frac{1}{2} V_{DD}^2 \sum_{i=0}^{n-1} R(X_i) C_{X_i}^{eff},$$
(19)

where n is the number of inputs of the reordered gate.

#### 3.3 Power Dissipation in the Reordered Gate

CMOS complementary gates consist of series/parallelconnected MOSFETs. The internal capacitances between series-connected MOSFETs influence on power dissipation in the reordered gate. An effective estimation method of the number of transitions at each internal node is proposed [7]. We utilize the method for the power estimation of the reordered gate. Here we explain the method briefly according to Ref. [7].

The power consumption of node  $n_k$  produced by input  $X_i$  ( $W_{n_k}|_{X_i}$ ) is represented as follows.

$$W_{n_k}|_{X_i} = \frac{1}{2} C_{n_k} V_{DD} (V_{DD} - V_{TH}) \cdot R(n_k)|_{X_i},$$
(20)

where  $C_{n_k}$  is the internal capacitance corresponding to node  $n_k$ .  $R(n_k)|_{X_i}$  is the transition rate of the transitions caused by the input  $X_i$  at the node  $n_k$ . If there are no simultaneous transitions,  $R(n_k)|_{X_i}$  is represented as follows.

$$R(n_k)|_{X_i} = R(X_i) \{ P(\frac{\partial H_{n_k}}{\partial X_i}) \overline{P(n_k)} + P(\frac{\partial G_{n_k}}{\partial X_i}) P(n_k) \}.$$
(21)

**Table 3**Delay time of a 4-input NAND gate.

|                | Pin A | Pin D | Pin D/Pin A |
|----------------|-------|-------|-------------|
| Rise Delay(ns) | 0.51  | 0.73  | 143%        |
| Fall Delay(ns) | 0.16  | 0.12  | 75%         |

The power dissipation of the reordered gate  $(PW_{reordered})$  is represented as follows.

$$PW_{reordered} = \sum_{k=0}^{p-1} (\sum_{i=0}^{n-1} W_{n_k}|_{X_i}) + \frac{1}{2} C_{load} V_{DD}^2 R(Y), \qquad (22)$$

where  $C_{load}$  is the load capacitance and p is the number of internal nodes and n is the number of inputs of the reordered gate. Thus we should reorder the inputs so as to decrease  $PW_{reordered}$ .

#### 3.4 Delay

The delay of a gate differs not only input by input but also by the direction of output transition (rise/fall). Even in transitions driven by a parallel-connected transistor (eg. output rise/fall for NAND/NOR gates), there exists the input-pin dependency as seen in Table 2. Also the fall/rise delay of the pin with the smallest rise/fall delay is not necessarily the smallest. Table 3 shows the delay time of 4-input NAND gate when the output load capacitance is 60fF and the transition time of the input signal is 1.5ns. Table 3 is different from Table 2 in the condition of the input transition time. The rise delay of input A is smaller than the rise delay of D. However the fall delay of A is larger than the fall delay of D. We therefore need to consider both rise and fall pin-to-pin delays for each input instead of reducing them to a single pin-to-pin delay as is done in conventional timing optimization approaches [10], [11]. This implies that we should associate two delays (fall and rise delays) with each output.

In order to evaluate the contribution of each delay to the overall circuit delay, we calculate two  $slacks(rise\_slack, fall\_slack)$  at each output and use them as the measure of delay, where the slack is defined as the difference between the required arrival time and the latest arrival time [12]. For the delay optimization, we use the input order which makes  $min(rise\_slack, fall\_slack)$  the largest. This strategy is greedy to minimize the delay, and does not increase the delay of a critical path.

## 4. Optimization Algorithm

In the previous section, we present two strategies for power reduction and one strategy for delay reduction. In this section, we discuss an algorithm which combines the three strategies for the total performance optimization of the whole circuit.



Fig. 5 Optimization algorithm in each gate.

#### 4.1 Optimization in Each Gate

Here we show an algorithm which optimizes a gate considering three strategies, that is to say, the strategy for the power reduction in the fan-in gates, the power reduction in the reordered gate, and the delay optimization for each gate.

First, we explain how to combine the power reduction strategy for the fan-in gates with that for the reordered gate. Using two estimated power dissipations ( $PW_{input}$  and  $PW_{reordered}$ ), we consider that the input ordering which minimizes the total power  $PW(=PW_{input} + PW_{reordered})$  is the best for low power. We try all the permutations and choose the one with the smallest PW. If the delay constraint is imposed on the power optimization, we calculate the slack for each permutation and select the ordering with the smallest PW and positive slack. This flow is shown in Fig. 5 (a).

In the case of delay optimization, we calculate min(*rise\_slack*, *fall\_slack*) for all the permutations and select the best order. This flow is shown in Fig. 5 (b).

#### 4.2 Optimization of the Whole Circuit

For delay optimization, each gate is reordered with the strategy in Sect. 4.1, in a breadth-first search order starting from a gate with all the inputs driven by primary inputs. A reordering of a certain gate may change the slack of a gate not only in the fan-out direction but also in the fan-in direction. It is because that the required time of a gate in the fan-in direction changes and the rise slack and the fall slack of the gate change accordingly. So, the order which has been processed previously is not necessarily the best order. Even if all the gates in the circuit have been reordered once, there is a possibility that further delay reduction can be achieved. Therefore, the delay optimization requires iterative optimization. The delay optimization loop finishes when the delay of critical path can not be decreased. In the case of power optimization, we apply

|          |         | Power Dissipation Delay |                |       | CPU     | NO      |       |      |
|----------|---------|-------------------------|----------------|-------|---------|---------|-------|------|
| Circuit  | Initial | Reduct                  | ion(%)         | Diff. | Initial | Reduc-  | Time  | of   |
|          | (mW)    | ΑŤ                      | B <sup>‡</sup> | (%)   | (ns)    | tion(%) | (s)   | Gate |
| sao2     | 2.69    | 3.5                     | 0.4            | 13.0  | 4.25    | 3.2     | 0.7   | 100  |
| mv_adder | 4.60    | 6.3                     | 6.2            | 5.8   | 11.8    | 0.1     | 42.1  | 112  |
| c432     | 5.57    | 12.9                    | 9.9            | 19.4  | 10.2    | 4.0     | 6.2   | 112  |
| apex7    | 4.79    | 5.6                     | 3.8            | 9.9   | 3.90    | 3.3     | 0.7   | 135  |
| clip     | 5.75    | 3.6                     | 2.4            | 8.4   | 4.45    | 0.2     | 0.7   | 154  |
| term1    | 5.94    | 4.5                     | 1.6            | 7.2   | 3.76    | -2.0    | 0.7   | 171  |
| example2 | 4.46    | 5.2                     | 1.6            | 22.3  | 4.02    | 3.5     | 1.0   | 172  |
| c499     | 3.92    | 1.9                     | 0.3            | 10.9  | 4.63    | 0.7     | 55.7  | 176  |
| alu2     | 8.87    | 7.4                     | 6.5            | 9.1   | 10.5    | -0.6    | 1.3   | 197  |
| x4       | 7.30    | 9.3                     | 10.3           | 22.5  | 3.77    | -0.7    | 2.9   | 201  |
| dule2    | 4.00    | 4.4                     | 0.7            | 17.1  | 5.08    | 3.9     | 0.8   | 210  |
| c1908    | 8.48    | 10.2                    | 8.0            | 20.9  | 10.1    | 3.3     | 503.2 | 249  |
| i9       | 17.8    | 6.1                     | 6.8            | 7.1   | 4.19    | 1.5     | 0.5   | 306  |
| i7       | 15.9    | 5.3                     | 5.4            | 7.8   | 3.61    | -1.1    | 0.5   | 314  |
| c1355    | 12.5    | 7.4                     | 4.7            | 10.9  | 8.48    | 3.6     | 160.2 | 326  |
| e64      | 4.84    | 5.1                     | -1.1           | 12.5  | 4.42    | -13.7   | 0.8   | 327  |
| table5   | 5.38    | 7.7                     | 0.4            | 19.5  | 6.22    | 3.3     | 1.2   | 383  |
| apex6    | 15.0    | 5.7                     | 6.5            | 10.9  | 4.44    | 3.9     | 7.1   | 391  |
| dalu     | 15.0    | 7.8                     | 6.0            | 15.0  | 7.58    | -3.6    | 6.6   | 407  |
| x3       | 14.5    | 2.5                     | 1.3            | 5.3   | 3.26    | -4.1    | 0.7   | 447  |
| table3   | 5.71    | 7.3                     | 0.0            | 17.3  | 6.22    | 1.0     | 1.1   | 454  |
| frg2     | 16.0    | 4.4                     | 2.5            | 8.8   | 5.74    | -0.4    | 2.3   | 476  |
| i8       | 25.8    | 7.8                     | 7.4            | 7.2   | 7.71    | 1.7     | 5.0   | 535  |
| c3540    | 29.1    | 3.4                     | 2.4            | 7.0   | 12.0    | 0.8     | 9.5   | 585  |
| apex3    | 10.8    | 7.2                     | 2.8            | 15.3  | 6.68    | 1.0     | 1.3   | 734  |
| ex5p     | 14.2    | 5.6                     | 5.6            | 10.4  | 7.12    | 5.7     | 1.1   | 935  |
| alu4     | 29.0    | 3.5                     | 0.6            | 10.7  | 6.91    | 2.4     | 5.8   | 937  |
| apex2    | 25.6    | 1.7                     | -0.7           | 13.6  | 8.09    | -2.5    | 16.8  | 1253 |
| seq      | 27.7    | 5.1                     | 1.5            | 13.3  | 8.30    | 5.1     | 6.1   | 1370 |
| des      | 73.9    | 7.5                     | 4.5            | 13.0  | 7.64    | 3.4     | 29.8  | 1718 |
| Average  | -       | 5.9                     | 3.6            | 12.4  | -       | 0.9     | -     | -    |

Table 4 Power optimization without delay optimization.

 $A^{\dagger}$ : Proposed Method  $B^{\ddagger}$ : Conventional Method

the algorithm in Sect. 4.1 to each gate once, assuming input reordering does not change the transition rate. In the case of delay and power optimization, delay optimization is executed first for minimizing the critical path delay. After that, power optimization is processed under the delay constraint as shown in Fig. 5 (a).

## 5. Experimental Results

In this section, we show the results of performance optimization by input reordering. All of experiments in this section are achieved with the condition below. We use process parameters for a commercial  $0.7 \,\mu\text{m}$  process. We evaluate the power dissipation by an eventdriven transistor-level power simulator with the option that enables to consider the dependence of input capacitance. Input patterns are randomly generated with a signal probability of 0.5 and with a transition density of 0.5, where transition density represents the average number of transitions per cycle [13]. The number of applied patterns is 100, which is the adequate number for the power estimation at circuit level [14]. The circuits are operated synchronously. The cycle time of input patterns is 20ns, which is the sufficient time for all benchmark circuits to finish the behavior. The transition rate R at each gate is computed by logic simulation, and the signal probability P is calculated using SBDD (shared binary decision diagram)<sup>†</sup>. The circuits used for the experiments are taken from ISCAS85 and LGSynth93 benchmark sets (See Table 4). The circuits are synthesized and mapped by a commercial logic synthesis tool. The target library includes basic gates and complex gates and selectors. These gates are the standard cells generated by P2Lib [15].

Table 4 lists the result of power optimization without delay optimization. The columns under "Initial" show the power dissipation (delay) of the initial circuits. We reorder the circuits with the following three strategies.

- A: The strategy which considers the dissipated power in the fan-in gates and the reordered gate (proposed).
- B: The strategy which considers the dissipated power

<sup>&</sup>lt;sup>†</sup>BDD Manipulator ver 6.03 : Copyright 1992 Kyoto University (by Shin-ichi MINATO).

only in the reordered gate (equivalent to Ref. [7]).

C: The strategy which maximizes the power dissipation using the proposed method.

The columns of "A" and "B" under "Reduction" represent the percentage of the power reduction (=  $\frac{Initial-A(orB)}{Initial} \times 100(\%)$ ). The column "Diff." explains the percentage of the difference between the largest and the smallest power dissipations (=  $\frac{C-A}{A} \times 100(\%)$ ). The column "CPU Time" lists a CPU time for reordering on a Sun Ultra 2. It does not include the time to calculate transition rate by logic simulation.

From Table 4, we can see that power dissipation of all circuits is reduced by the proposed method. The "Diff." column indicates that there is a possibility of reducing power dissipation by 22.5% maximum. The proposed method (Column "A") reduces power dissipation by 5.9% on average and by 12.9% maximum, whereas a conventional method, which considers the dissipated power only in the internal capacitances of the reordered gate, reduces power dissipation by 3.6% on average and by 10.4% maximum. The power optimization without delay optimization does not affect delay so much.

In Table 5, the result of power optimization with delay optimization is shown. Our method reduces delay

Table 5Delay and power optimization.

| Circuit  | Delay                          | Power                          | Time  |
|----------|--------------------------------|--------------------------------|-------|
|          | $\operatorname{Reduction}(\%)$ | $\operatorname{Reduction}(\%)$ | (s)   |
| sao2     | 11.2                           | 2.4                            | 0.9   |
| my_adder | 0.1                            | 6.3                            | 26.0  |
| c432     | 9.9                            | 7.0                            | 20.4  |
| apex7    | 9.9                            | 3.9                            | 0.9   |
| clip     | 5.7                            | 3.1                            | 1.1   |
| term1    | 9.6                            | 2.4                            | 0.9   |
| example2 | 7.9                            | 5.2                            | 1.2   |
| c499     | 7.6                            | 0.7                            | 30.7  |
| alu2     | 4.4                            | 8.1                            | 1.5   |
| x4       | 4.4                            | 5.2                            | 2.9   |
| dule2    | 11.9                           | 3.4                            | 1.2   |
| c1908    | 8.1                            | 10.0                           | 495.3 |
| i9       | 4.4                            | 6.4                            | 0.9   |
| i7       | 4.3                            | 4.8                            | 1.1   |
| c1355    | 9.1                            | 6.5                            | 115.1 |
| e64      | 2.9                            | 4.9                            | 1.5   |
| table5   | 8.3                            | 7.3                            | 1.9   |
| apex6    | 7.0                            | 5.4                            | 10.5  |
| dalu     | 7.1                            | 8.2                            | 10.0  |
| x3       | 7.3                            | 2.9                            | 1.5   |
| table3   | 4.9                            | 7.1                            | 2.2   |
| frg2     | 4.6                            | 3.7                            | 2.6   |
| i8       | 1.9                            | 9.3                            | 7.6   |
| c3540    | 6.4                            | 2.6                            | 21.2  |
| apex3    | 4.5                            | 8.1                            | 3.1   |
| ex5p     | 8.2                            | 6.2                            | 5.2   |
| alu4     | 7.6                            | 3.1                            | 9.3   |
| apex2    | 7.8                            | 2.1                            | 21.7  |
| seq      | 6.3                            | 4.4                            | 10.6  |
| des      | 8.4                            | 7.5                            | 40.9  |
| Average  | 6.7                            | 5.3                            | -     |

by 6.7% and power dissipation by 5.3% on average.

## 6. Conclusion

We propose an improved method for power optimization of CMOS gates by input reordering. The dependence of input capacitance on the signal values of other inputs, as well as the possibility of charging/discharging internal capacitances, is utilized for the power reduction. The effect of the method is demonstrated experimentally using 30 benchmark circuits in a  $0.7 \,\mu\text{m}$ CMOS technology. The average reduction of power dissipation is 5.9%. By input reordering there is a possibility that power dissipation is reduced by 22.5% maximum. In the case of delay and power optimization, our method improves delay by 6.7% and power dissipation by 5.3% on average. Although the amount of improvement in power and delay is not drastic, input reordering can provide a steady improvement with almost zero penalty.

#### References

- A.P. Chandrakasan, S. Sheng, and R.W. Brodersen, "Lowpower CMOS digital design," IEEE Journal of Solid-State Circuits, vol.27, no.4, pp.473–484, April 1992.
- [2] K. Usami and M. Horowitz, "Clustered voltage scaling technique for low-power design," In Proceedings of IEEE/ACM International Symposium on Low Power Design, pp.3–8, 1995.
- [3] C.-Y. Tsui, M. Pedram, and A.M. Despain, "Power efficient technology decomposition and mapping under an extended power consumption model," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems vol.13, no.9, pp.1110–1122, Sept. 1994.
- [4] M. Hashimoto, H. Onodera, and K. Tamaru, "A power optimization method considering glitch reduction by gate sizing," In Proceedings of IEEE/ACM International Symposium on Low Power Electronics and Design, pp.221–226, 1998.
- [5] B. Lin and H. De Man, "Low-power driven technology mapping under timing constraints," In Proceedings of IEEE International Conference on Computer Design, pp.421–427, 1993.
- [6] R. Hossain, M. Zheng, and A. Albicki, "Reducing power dissipation in CMOS circuits by signal probability based transistor reordering," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.15, no.3, pp.361–368, March 1996.
- [7] E. Musoll and J. Cortadella, "Optimizing CMOS circuits for low power using transistor reordering," In Proceedings of European Design and Test Conference, pp.219–223, 1996.
- [8] S.C. Prasad and K. Roy, "Transistor reordering for power minimization under delay constraint," ACM Transactions on Design Automation of Electronic Systems, vol.1, no.2, pp.280–300, April 1996.
- [9] W.-Z. Shen, J.-Y. Lin, and F.-W. Wang, "Transistor reordering rules for power reduction in CMOS gates," In Proceedings of Asia and South Pacific Design Automation Conference, pp.1–6, 1995.
- [10] B.S. Carlson and S.-J. Lee, "Delay optimization of digital CMOS VLSI circuits by transistor reordering," IEEE Transactions on Computer-Aided Design of Integrated Cir-

cuits and Systems, vol.14, no.10, pp.1183–1192, Oct. 1995.

- [11] M. Marek-Sadowska and S.P. Lin, "Pin assignment for improved performance in standard cell design," In Proceedings of IEEE International Conference on Computer Design, pp.339–342, 1990.
- [12] R.B. Hitchcock, G.L. Smith, and D.D. Cheng, "Timing analysis of computer hardware," IBM Journal of Research and Development, vol.26, no.1, pp.100–105, Jan. 1982.
- [13] F.N. Najm, "Transition density, a stochastic measure of activity in digital circuits," In Proceedings of the 28th IEEE/ACM Design Automation Conference, pp.644–649, 1991.
- [14] D. Brand and C. Visweswariah, "Inaccuracies in power estimation during logic synthesis," In Proceedings of the 1996 IEEE/ACM International Conference on Computer-Aided Design, pp.388–394, 1996.
- [15] H. Onodera, A. Hirata, T. Kitamura, and K. Tamaru, "P2lib : Process-portable library, and its generation system," In Proceedings of IEEE Custom Integrated Circuits Conference, pp.341–344, 1997.



Keikichi Tamaru was born in Sendai, Japan in 1936. He received the B.E., M.E. and Dr. Eng. degrees in electronic engineering from Kyoto University, Kyoto Japan, in 1958, 1960 and 1970, respectively. From 1960 to 1979 he was engaged in the development of high-speed logic circuits and micro computers at Toshiba Research and Development Center, Toshiba Corporation, Kawasaki, Japan. Since 1979, he has been a Professor in

the Department of Electronics and Communication, Graduate School of Engineering, Kyoto University. His research interests include computer-aided design for digital and analog integrated circuits and the architecture of special-purpose VLSI processors. Dr. Tamaru is a member of IPSJ, IEICE, IEEE, and ACM.



Masanori Hashimoto received the B.E. degree in electronic engineering from Kyoto University, Kyoto, Japan, in 1997. Currently, he is studying toward the M.E. and Ph.D. degrees in the Department of Communications and Computer Engineering, Kyoto University. His research interest includes computer-aideddesign for digital integrated circuits.



Hidetoshi Onodera received the B.E., and M.E., and Dr. Eng. degrees in Electronic Engineering from Kyoto University, Kyoto, Japan, in 1978, 1980, 1984, respectively. Since 1983 he has been an Instructor (1983–1991) and an Associate Professor (1992–) in the Department of Electronics and Communication, Graduate School of Engineering, Kyoto University. His research interests include computer-aided-design for integrated cir-

cuits, and analog and mixed analog-digital circuits design. He is a member of the Information Processing Society of Japan, and IEEE.