Clock Skew Evaluation Considering Manufacturing Variability in Mesh-Style Clock Distribution

Shinya ABE\(^{(a)}\), Student Member, Masanori HASHIMOTO\(^{(a),†(b)}\), and Takao ONOYE\(^{(a),††(c)}\), Members

SUMMARY Influence of manufacturing variability on circuit performance has been increasing because of finer manufacturing process and lowered supply voltage. In this paper, we focus on mesh-style clock distribution which is believed to be effective for reducing clock skew, and we evaluate clock skew considering manufacturing and design variabilities. Considering MOS transistor variation — random and spatially-correlated variation — and non-uniform flip-flop (FF) placement, we demonstrate that spatially-correlated variation and severe non-uniform FF distribution can be major sources of clock skew. We also examine the dependency of clock skew on design parameters, and reveal that finer clock mesh does not necessarily reduce clock skew.

key words: mesh-style clock distribution, clock skew

1. Introduction

High performance microprocessor design enforces us to intensively minimize clock skew. Smaller clock skew is desirable not only for shortening clock cycle but also for reducing delay elements inserted to satisfy hold time constraints. The constraint of clock skew is often given as 5–10% of the clock cycle [1]. The obstacles to satisfy the skew constraints are: (1) manufacturing variability and environmental fluctuation such as power supply noise and variation in temperature, (2) design imperfection such as difference of wire length between clock source and FF and non-uniform FF placement. These obstacles cannot be completely eliminated, and thus we have to develop a design scheme for robust clock distribution.

Mesh-style clock distribution has been recognized as a robust structure, and is used for several high performance microprocessor designs [2], [3]. In SoC design, a property of small skew in mesh structure is exploited for power reduction by using smaller clock drivers [4]. Conventionally, the robustness of mesh structure against non-uniform FF placement has been focused and utilized. As for CAD area, efficient analysis methods are studied [5], because driver outputs are shunted with the mesh and conventional gate-level timing analyzers cannot be used for clock skew analysis.

Among manufacturing variabilities, the variability of transistor is reported to be more significant than that of wire [6]. Transistor variability is often decomposed into chip-to-chip, layout-dependent, random and spatially-correlated components [7]. The chip-to-chip variability component affects chip performance directly, however from a viewpoint of clock skew, its impact is a secondary effect. The layout-dependent component can be modeled in LPE (layout parameter extraction) and cell library design, though its modeling might be expensive. The random component follows a normal distribution, and affects clock skew. The spatially-correlated component is sometimes modeled such that the correlation coefficient of two devices is expressed as a function of the distance between them [8].

In this paper, we focus on a hybrid structure that consists of a tree for global distribution and a mesh for final clock distribution, and quantitatively reveal dominant variabilities that fluctuate clock skew. This work evaluates clock skew considering “static” variability that includes manufacturing variability and design imperfection. The random and spatially-correlated components of transistors and non-uniform FF placement are considered in our analysis. We also evaluate clock skew changing two important design parameters of clock mesh, mesh pitch and driver strength, and examine the dependency on these design parameters.

This paper is organized as follows. Section 2 describes the feature of mesh-style clock distribution. Section 3 evaluates clock skew considering manufacturing variability. We show dominant variability sources, demonstrate an example that finer mesh does not necessarily minimize clock skew, and point out a possibility of low power clock distribution. We finally conclude the discussion in Sect. 4.

2. Mesh-Style Clock Distribution

A hybrid clock distribution structure with a tree and a mesh as shown in Fig. 1 is often used in high-performance micro-

---

Fig. 1 Mesh-style clock distribution.
processor design [2], [3]. The mesh is a grid structure, and a shunt wire connects adjacent nodes. The shunt wire reduces the clock skew which would intrinsically yield when no shunt wires were inserted, because early and late clock arrivals are equalized [9]. The mesh-style clock distribution thus has a property of clock skew reduction. However, generally speaking, it requires and involves much wire resource, wire capacitance, layout area, and power consumption. In addition, since nodes are connected to each other, conventional timing analysis cannot be applied, and the design verification including clock skew analysis needs long circuit simulation.

The clock distribution network usually has a large amount of load capacitance associated with large wire resistance, and then a number of buffers are inserted to avoid waveform distortion and reduce the latency from the source to the FFs. A cascade driver (Fig. 2) is often used to drive a large load capacitance. The ratio $\lambda$ of output capacitance $C_{out}$ to input capacitance $C_{in}$ is assigned to four to six [1].

**Fig. 2 Cascade driver.**

3. Clock Skew Evaluation Considering Manufacturing Variability

This section first explains the evaluation setup of clock distribution network and variability modeling. We then demonstrate experimental results.

3.1 Design of Clock Distribution Network and Its Modeling for Circuit Simulation

The clock distribution structure used for the experiment is the hybrid structure with a H-tree and a mesh shown in Fig. 1. We here assume a clock distribution within a 1mm square clock domain in an industrial 90nm technology. The mesh pitch is 100$\mu$m, and the mesh is constructed with intermediate wires. The wiring material is copper, and we calculate the wire capacitance and resistance by given process information from the foundry. The wires are modeled by $\pi$ ladder circuit as shown in Fig. 3 for circuit simulation. For a convenience, the mesh wire is segmented for every 1$\mu$m, and each segment is represented as a $\pi$ ladder circuit. The wires in the tree and the wires from the mesh to FFs are modeled by 3$\pi$ ladder circuits. The width of the mesh and tree wire is 0.64$\mu$m, and that of the mesh to FFs wire is 0.14$\mu$m.

**Fig. 3 $\pi$ ladder model.**

H-tree structure, which is symmetrically branched, is used for the tree that drives the mesh. Each mesh node is driven by a final-stage driver, and the H-tree delivers clock signal to the final-stage drivers. In this experiment, clock drivers are inserted in every two branches of H-tree. We adopted the driver sizes of $\lambda = C_{out}/C_{in} = 5$ assuming FF placement is uniform, which means that the drivers at the same stage have the same size.

3.2 Experimental Condition

We used a transistor model developed so that the current characteristics are compatible with the prediction in ITRS 2005 roadmap [10]. Clock skew for the fall clock edge is evaluated in Monte Carlo manner. We prepared 300 sets of manufacturing variability information according to the assumed variability model below. A clock signal with 30ps transition time is given to the source of clock distribution network. Manufacturing variability of interconnects is not considered, because the variability of transistor is more significant than that of wire [6].

As a representative variation source of transistors, we focus and fluctuate threshold voltage $V_{th}$. Other variations are not considered in this paper. We assume that $V_{th}$ variation consists of random and spatially-correlated components. The random component is size dependent [11]; the variability is significant in a small transistor and it becomes less significant as the size increases. In this analysis, to reproduce size dependent random component, we build a large driver by connecting a number of small unit drivers in parallel, because each small unit driver has an independent random value and it cancels each other in the case of building a large driver. We assumed that the spatially-correlated component has the correlation coefficient expressed as $f(x) = e^{-2x}$, where $x [\text{mm}]$ is the distance of two devices [6]. For the small unit driver, the total standard deviation of these components is 25mV, which is considered to be appropriate in 90nm technology [6], and these variances are the same value.

FFs are not placed uniformly, and in some cases, FF densities are much different inside a chip. Even inside a functional block, the FF placement is often far from uniform distribution. This non-uniform FF placement has been a major source of clock skew so far, and we will evaluate the importance of FF placement quantitatively comparing with manufacturing variability. We used FF placement distributions of two actual designs (FPU [12] and MeP [13]), layed out by a commercial place-and-route tool with the 90nm industrial standard cell library. These FF distributions are shown in Figs. 4 and 5. The numbers of FFs in FPU and MeP
are 667 and 4411, respectively. The input capacitance of each FF is 1.9 fF. These locations and capacitances are relatively extended for a 1 mm square clock domain. To make a comparison, we also build uniform FF distributions. The FF load capacitance is uniformly distributed with a 40 μm resolution. The total load capacitance is identical to those of the actual layouts. The FFs are connected to the nearest mesh segment (Fig. 6).

To reveal dominant variabilities, we evaluate clock skew with various combinations of these variabilities. We also examine a tree clock distribution, in which the mesh shunts are simply removed, to confirm the effectiveness of mesh in clock skew reduction. In this case, FFs are connected to the nearest final-stage driver.

3.3 Results

Figure 7 shows the skew histogram in the case that random and spatially-correlated $V_{th}$ variabilities and non-uniform FF placement of MeP are given. Figures 8–10 are the histograms excluding each of random $V_{th}$, spatially-correlated $V_{th}$, and FF placement variation, respectively. Table 1 lists the average and standard deviation of the maximum skew with various combinations of these variabilities.

Figures 7 and 8 are similar to each other. This indicates that the total skew is very dependent on spatially-correlated variability. On the other hand, the skew histogram in Fig. 9 is very sharp, which means the skew variation due to random component is small. The skew variation is reduced by
the mesh structure and the large drivers in which random variability is averaged out.

Figure 11 shows an example of $V_{th}$ variation. At each position, there are several transistors, and then the average of them is depicted. Figure 12 shows the arrival times of the clock signal at each FF, when the variability of Fig. 11 is given. We can see that the arrival times are smoothed and the peaks corresponding to $V_{th}$ local variations are flattened. The difference of arrival time between adjacent nodes is averaged by the mesh shunt connection. On the other hand, the clock arrival time changes along with the global $V_{th}$ gradient, which results in a large clock skew. The mesh shunt has non-negligible resistance, and hence the effective area of the averaging effect is limited. Thus, the mesh-style clock distribution is robust to random variations, and is not so tolerant to spatially-correlated variation. We therefore do not pay much attention to random variation as long as the mesh-style clock distribution, and we should focus on how to mitigate the clock skew caused by the spatially-correlated variability.

When FFs are uniformly distributed, the average of the clock skew becomes smaller (See Figs. 7 and 10). However, in the case of MeP, the non-uniformity of FF placement is not severe (Fig. 5), and then the skew difference in terms of FF placement is 4 to 6 ps and is not quite large (Table 1). On the other hand, in the case FPU design (Fig. 4), the FFs are not well distributed, and hence the non-uniform placement increases the clock skew by over 10 ps. As for the variance of the clock skew, the standard deviation is comparable regardless of the uniform and non-uniform placement both in MeP and FPU.

To clearly observe the skew reduction effect by the mesh shunt, we evaluate the clock skew for a tree-only distribution without mesh shunts considering the random and spatially-correlated $V_{th}$ variabilities. The results are also listed in Table 1. When the shunts are removed, the clock skew becomes 5 to 8 times larger, which means the shunts are very effective for designing a clock distribution with small clock skew.

3.4 Skew Evaluation under Various Mesh Pitches and Fanouts

In the previous subsection, the mesh pitch is fixed to 100 $\mu$m and the effective fanout ($\lambda = C_{out}/C_{in}$) is set to 5. In this subsection, we regard the mesh pitch and $\lambda$ as design parameters, and evaluate the clock skew varying these design parameters.

![Fig. 11 $V_{th}$ variation.](image)

![Fig. 12 Clock arrival time.](image)

**Table 1** Average $\mu$ and standard deviation $\sigma$ of max skew. † In a tree-only distribution, mesh shunts are removed.

<table>
<thead>
<tr>
<th>Core</th>
<th>Variability</th>
<th>$\mu$ ($\pm$ $\sigma$) [ps]</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>non-unif. FF</td>
</tr>
<tr>
<td>FPU</td>
<td>random, spatially-correlated</td>
<td>18.20 ± 1.77</td>
</tr>
<tr>
<td></td>
<td>spatially-correlated</td>
<td>18.39 ± 1.87</td>
</tr>
<tr>
<td></td>
<td>random</td>
<td>17.36 ± 0.19</td>
</tr>
<tr>
<td></td>
<td>none</td>
<td>17.36</td>
</tr>
<tr>
<td></td>
<td>random, spatially-correlated (tree-only†)</td>
<td>134.12 ± 5.94</td>
</tr>
<tr>
<td>MeP</td>
<td>random, spatially-correlated</td>
<td>16.38 ± 2.19</td>
</tr>
<tr>
<td></td>
<td>spatially-correlated</td>
<td>16.39 ± 2.55</td>
</tr>
<tr>
<td></td>
<td>random</td>
<td>14.31 ± 0.22</td>
</tr>
<tr>
<td></td>
<td>none</td>
<td>14.31</td>
</tr>
<tr>
<td></td>
<td>random, spatially-correlated (tree-only†)</td>
<td>34.72 ± 4.05</td>
</tr>
</tbody>
</table>
We first analyze the clock skew changing the mesh pitches from 30 μm to 500 μm. When changing the mesh pitch, the wire widths of the tree and the mesh are adjusted so that the total wiring area excluding the mesh-FF wires is unchanged. Although the number of branches to the final drivers in the tree varies according to the mesh pitch, the number of driver stages in the tree is fixed so as not to change the clock latency drastically.

Figure 13 shows the relation between the clock skew and the mesh pitch when the random and spatially-correlated $V_{th}$ variabilities and non-uniform FF placement of MeP are given and $\lambda$ is 5. The error bar indicates the standard deviation. As the mesh pitch becomes 150 μm and above, the skew increases rapidly, because longer mesh shunt diminishes the averaging effect of clock skew reduction. On the other hand, when the mesh pitch is 150 μm to 30 μm, the clock skew is almost the same.

Figure 13 shows the relation between the clock skew and the mesh pitch when the random and spatially-correlated $V_{th}$ variabilities and non-uniform FF placement of MeP are given and $\lambda$ is 5. The error bar indicates the standard deviation. As the mesh pitch becomes 150 μm and above, the skew increases rapidly, because longer mesh shunt diminishes the averaging effect of clock skew reduction. On the other hand, when the mesh pitch is 150 μm to 30 μm, the clock skew is almost the same.

We found an interesting reason that the resistance between two nodes increases when the mesh becomes finer while keeping the total wiring area unchanged. To illustrate the fact, we compute the resistance between the corner node (In) and the node on the diagonal line (Out), as shown in Fig. 14. The widths of wires in the cases of 100 μm and 50 μm mesh pitches are 0.64 μm and 0.33 μm, respectively. Figure 15 shows the resistance between In and Out, and the horizontal axis is the Manhattan distance between In and Out. Figure 16 shows the resistance increase comparing 50 μm mesh pitch with 100 μm mesh pitch. The resistance with 50 μm mesh pitch is larger, and especially, the increasing rate is higher in the case of short distance.

We explain the reason using Fig. 17. The mesh pitches in Figs. 17(a), (b) are 100 μm and 50 μm, respectively. We compare the resistances between A and B ($R_{AB}$) and between A’ and B’ ($R_{A'B'}$), which have the same physical node-to-node distance. We here assume that the resistive mesh extends to infinity. The resistance of each segment in the 50 μm-pitch mesh is the same as that in the 100 μm-pitch mesh, because both the length and width are half. Therefore, in the electrical circuit model, the square including A and B as vertices in Fig. 17(a) corresponds to the square whose vertices are A’ and C’ in Fig. 17(b). Thus, $R_{AB}$ and $R_{A'C'}$ are identical. In contrast, resistance $R_{A'B'}$ is larger than $R_{A'C'}$, since B’ is more distant than C’ from A’. Hence, $R_{A'B'}$ is larger than $R_{AB}$.

Therefore, the finer mesh increases the resistance between neighboring nodes, i.e. the finer mesh reduces the averaging effect of mesh shunts, which results in the increase in clock skew. We might intuitively think that the finer mesh gives better performance in clock skew reduction, but this is not true. The mesh with very fine mesh is not necessarily...
effective for clock skew reduction.

We secondly evaluate the clock skew varying effective fanout $\lambda$ from 3 to 20. The experimental condition is similar to the previous condition, except that the mesh pitch is fixed to 100 $\mu$m. Figure 18 shows the clock skew and dynamic energy per a clock cycle. The larger fanout increases the clock skew, but it reduces the dynamic energy. Figure 19 shows the latency from the source to the FFs. When the fanout varies from 3 to 20, the latency increases by over 300%. On the other hand, the clock skew increases by 50%. We think that the skew increase by weakening driver strength is limited and the robustness does not deteriorate so much thanks to the mesh shunts.

Compared to a tree-only distribution, the mesh-style distribution achieves small clock skew, and then we may have a possibility to reduce the clock energy by controlling the fanout in the case that the target skew is satisfied. When assuming the mean + sigma target clock skew is required to be below 20 ps, we can choose the effective fanout $\lambda$ of 7, which results in 60% dynamic energy reduction compared to $\lambda = 3$.

4. Conclusion

We evaluated clock skew considering manufacturing variability of $V_{th}$ and FF placement in mesh-style clock distribution in a 90 nm technology. The spatially-correlated variation has more impact on clock skew than random variation. Non-uniformity of FF placement can be a large source of clock skew. We revealed that the finer mesh is not necessarily effective for clock skew reduction, because of resistance increase between neighboring nodes. We also discussed the possibility that clock mesh is suitable for low energy clock distribution by controlling the driver fanout.

Acknowledgments

This work is supported in part by NEDO.

References


Shinya Abe received the B.E. degree in information systems engineering from Osaka University in 2007. He is currently pursuing the M.E. degree in the Department of Information Systems Engineering at Osaka University. His major interest is mesh-style clock distribution. He is a student member of IEEE.
Masanori Hashimoto received the B.E., M.E. and Ph.D. degrees in Communications and Computer Engineering from Kyoto University, Kyoto, Japan, in 1997, 1999, and 2001, respectively. Since 2004, he has been an Associate Professor in Department of Information Systems Engineering, Graduate School of Information Science and Technology, Osaka University. His research interest includes computer-aided design for digital integrated circuits, and high-speed circuit design. Dr. Hashimoto served on the technical program committees for international conferences including DAC, ICCAD, ASP-DAC, ICCD and ISQED. He is a member of IEEE and IPSJ.

Takao Onoye received B.E. and M.E. degrees in Electronic Engineering, and Dr.Eng. degree in Information Systems Engineering all from Osaka University, Japan, in 1991, 1993, and 1997, respectively. He is currently a professor in the Department of Information Systems Engineering, Osaka University. His research interests include media-centric low-power architecture and its SoC implementation. He is a member of IEEE, IPSJ, and ITE-J.