# Experimental Study of Proton-Induced Radiation Effects on DDR5 Modules

Yang Li<sup>®</sup>, Masakazu Yoshida<sup>®</sup>, Yuibi Gomi<sup>®</sup>, Yifan Deng<sup>®</sup>, Yukinobu Watanabe<sup>®</sup>, Satoshi Adachi, Masatoshi Itoh, Guohe Zhang<sup>®</sup>, *Member, IEEE*, Chaohui He, and Masanori Hashimoto<sup>®</sup>, *Senior Member, IEEE* 

Abstract-Double data rate 5 synchronous dynamic random access memory (DDR5 SDRAM), as the latest generation in its family, is an outstanding candidate for future space applications, highlighting the importance of considering its radiation performance. In this article, we investigated the proton-induced radiation effects on DDR5 dual-inline-memory-modules (DIMMs) for the first time. Consumer-grade DDR5 modules were tested, taking into account several factors, including proton energy, module vendors, and the specific power management unit (PMU) on DDR5. The results provided the single-event effect (SEE) cross section (CS) curve as a function of proton energy and uncovered the sensitivity of different vendors and the PMU. In addition, comparison tests between server-grade DDR4 and DDR5 modules were conducted to study the impacts of different generations, external error correction code (ECC) cases, and accumulated effects. Fault injection simulations were also conducted to identify potential causes for the observed patterns in the experiments with the existence of on-die ECC.

*Index Terms*—Accumulated radiation effect, double data rate 5 (DDR5), dynamic random access memory (DRAM), on-die error correction code (ECC), proton, single event effect (SEE), soft error.

## I. INTRODUCTION

**D**OUBLE data rate 5 (DDR5) represents the latest generation in its family. Compared to its predecessor, DDR4, the structure of DDR5 has undergone significant changes

Received 10 March 2025; revised 23 April 2025; accepted 24 April 2025. Date of publication 28 April 2025; date of current version 18 June 2025. This work was supported in part by Japan Science and Technology Agency (JST), Core Research for Evolutional Science and Technology (CREST) under Grant JPMJCR19K5, in part by the Grant-in-Aid for Scientific Research (S) from Japan Society for the Promotion of Science (JSPS) under Grant 19H05664 and Grant 24H00073, and in part by the China Scholarship Council under Grant 202206280203. (*Corresponding author: Masanori Hashimoto.*)

Yang Li was with the Department of Informatics, Kyoto University, Kyoto 606-8501, Japan. He is now with the School of Microelectronics, Xi'an Jiaotong University, Xi'an 710049, China (e-mail: yang.li@xjtu.edu.cn).

Masakazu Yoshida, Yuibi Gomi, and Masanori Hashimoto are with the Department of Informatics, Kyoto University, Kyoto 606-8501, Japan (e-mail: hashimoto@i.kyoto-u.ac.jp).

Yifan Deng was with the Interdisciplinary Graduate School of Engineering Sciences, Kyushu University, Fukuoka 816-8580, Japan. He is now with the Spallation Neutron Source Science Center, Dongguan 523803, China, and also with the Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049, China (e-mail: dengyf@ihep.ac.cn).

Yukinobu Watanabe is with the Faculty of Engineering Sciences, Kyushu University, Fukuoka 816-8580, Japan (e-mail: watanabe@ aees.kyushu-u.ac.jp).

Satoshi Adachi and Masatoshi Itoh are with the Cyclotron and Radioisotope Center, Tohoku University, Sendai 980-8576, Japan.

Guohe Zhang is with the School of Microelectronics, Xi'an Jiaotong University, Xi'an 710049, China (e-mail: zhangguohe@xjtu.edu.cn).

Chaohui He is with the School of Nuclear Science and Technology, Xi'an Jiaotong University, Xi'an 710049, China (e-mail: hechaohui@xjtu.edu.cn).

Color versions of one or more figures in this article are available at https://doi.org/10.1109/TNS.2025.3565125.

Digital Object Identifier 10.1109/TNS.2025.3565125

[1]. For instance, DDR5 chips adopt a lower operating voltage and introduce an on-die error correction code (ECC) to ensure data correctness. In addition, DDR5 dual-inlinememory-modules (DIMMs) incorporate a discrete power management unit (PMU) for enhanced power efficiency [2]. Owing to these enhancements, DDR5 modules offer multiple advantages, such as substantially higher speeds, increased capacity, and improved performance, among others. It has already found widespread application in consumer electronics and holds great potential for use in aerospace electronics, including satellites and 5G telecommunications, in the future [3].

In the context of aerospace electronics, protons represent a significant radiation threat to DDR5, as they are a primary component of cosmic rays [4]. For modern integrated circuits (ICs), both high- and low-energy protons can cause single-event upsets (SEUs) [5], [6]. Moreover, protons can lead to cumulative damage over long-term exposure, such as total ionizing dose (TID) and displacement damage (DD) effects [7], [8]. In fact, it has been demonstrated that previous generations, such as DDR2, DDR3, and DDR4, are sensitive to proton exposure, indicating that synchronous dynamic random access memories (SDRAMs) are vulnerable to such radiation [9], [10]. The variety of proton-induced SEUs in SDRAMs includes single-bit upset (SBU) or multiple-bit upset (MBU), row/column burst clusters, and single-event functional interrupts (SEFIs) [11]. Accumulated effects can lead to stuck bits or reduced data retention time in memory cells [12], [13]. Consequently, DDR5 is expected to encounter similar challenges from proton exposure.

More importantly, the innovative features of DDR5 modules bring about significant uncertainty with respect to their radiation effects. Specifically, the consequences of integrating on-die ECC and incorporating discrete on-board PMU, as well as the fundamental differences in radiation response between DDR4 and DDR5, remain unclear. This knowledge is essential for application designers. Although the general reliability of DDR5 has been explored, as noted in [1], reports on its susceptibility to radiation exposure are lacking. Thus, conducting research on the radiation effects of DDR5 is imperative for its prospective applications.

This article specifically addresses the proton-induced radiation effects on DDR5 modules, encompassing single-event effects (SEEs) and cumulative damage, primarily through irradiation experiments. Our experiments are designed to yield comprehensive results by considering several factors: 1) proton energy levels; 2) vendors of consumer-grade modules; 3) the

0018-9499 © 2025 IEEE. All rights reserved, including rights for text and data mining, and training of artificial intelligence

and similar technologies. Personal use is permitted, but republication/redistribution requires IEEE permission.

See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Kyoto University. Downloaded on June 20,2025 at 10:56:37 UTC from IEEE Xplore. Restrictions apply.

PMU chip of consumer-grade modules; 4) the presence of external ECC in server-grade DDR4 and DDR5 modules; and 5) their cumulative effects. For server-grade modules, DDR4 is equipped with external sideband ECC, whereas DDR5 benefits from both external ECC and on-die ECC. We present and discuss the observed cross sections (CSs) and failure characteristics. Furthermore, to elucidate the mechanisms behind the error patterns observed in consumer-grade modules, we conduct fault injection simulations and offer speculative insights.

The structure of this article is organized as follows. Section II details the experimental setup used for proton irradiation studies. Section III presents and elaborates on the experimental findings. Fault injection simulations related to the on-die ECC, along with their analysis, are discussed in Section V. Sections VI and VII provide additional analysis and conclude this article, respectively.

## II. EXPERIMENTAL SETUP

## A. Device Under Tests

Devices under test (DUTs) are DDR modules. Table I provides a summary of the DUT information. The experiments involve consumer-grade DDR5 modules from two vendors: KINGSTON (DIMM No. KVR48U40BSB-16, labeled as CG-DDR5-K) and ADATA (DIMM No. PC5-38400, labeled as CG-DDR5-A), with their bare chips sourced from Micron and Hynix, respectively. In the experiments, CG-DDR5-K undergoes a proton energy sweep up to a maximum energy of 75.88 MeV, including independent irradiation of its PMU. Comparisons between the two vendors focus on proton energies of 60.36 and 75.88 MeV. Both consumer-grade modules are equipped solely with an inherent on-die ECC.

In addition, a server-grade DDR5 module (KVR48U40BSB-16HM, labeled as SG-DDR5-K), featuring both on-die ECC and external ECC, is utilized in the experiments. This setup allows for an analysis of the contributions from both types of ECC. Moreover, a server-grade DDR4 module (KSM32N22S8/16), equipped solely with external ECC and labeled SG-DDR4-K, is selected for comparison with the server-grade DDR5 module. Both SG-DDR5-K and SG-DDR4-K modules employ Hynix technology, facilitating a fair comparison of their radiation performance.

All DUTs, tested with a standard power supply voltage of 1.1 V at room temperature ( $\approx 25$  °C), feature the same capacity of 16 GB. The consumer-grade modules are equipped with eight chips, whereas server-grade modules include an additional one (DDR4) or two (DDR5) chips dedicated to storing external ECC check bits. All chips are mounted on one side of the modules. Four distinct motherboards were selected to accommodate CG-DDR5-K, CG-DDR5-A, SG-DDR5-K, and SG-DDR4-K, as listed in Table I. Notably, the external ECC for SG-DDR5-K and SG-DDR4-K can be toggled ON or OFF through the basic input output system (BIOS) configurations of their respective motherboards. In contrast, the on-die ECC feature of all DDR5 modules operates inherently.

## B. Test Program

MemTest86, used for experiments, is a standalone memory testing software designed for x86 and ARM computers [14]. It features 14 sub-tests, ranging from Test 0 to Test 13, to meet diverse testing needs. Its free version is integrated into many hardware BIOS systems for quick memory diagnostics. Beyond its industrial applications, MemTest86 is also utilized within the academic community, notably in studies testing the radiation effects on DDR memories, such as SEEs [15]. The site version of MemTest86 with some motherboards uniquely offers capabilities for decoding and identifying the specific chips where errors occur. In our DUT setup, the MemTest86 Site Version (10.2) can detect SEUs and pinpoint the relevant addresses.

1) Test for SEEs: Test 5 (moving inversions and random pattern) is employed for SEE testing. The procedure of the moving inversion test is shown in Table II.

Detected errors are recorded in log files, which include the upset bits within the data, their corresponding memory addresses, and the specific DIMM or chip implicated. Utilizing these data, the SEEs can be thoroughly analyzed. In addition, the test system was configured to restart after every two testing rounds to eliminate any potential cumulative effects.

2) Test for Accumulated Effects: Test 5 (moving inversions, random pattern) and Test 13 (hammer test) are utilized to assess stuck bits and weak bits, respectively, caused by accumulated irradiation. Proton-induced accumulated effects can lead to defects in semiconductor devices, particularly at the Si-SiO<sub>2</sub> interface, causing charge leakage [16]. Due to variations in manufacturing processes, dynamic random access memory (DRAM), cells exhibit different levels of susceptibility to these accumulated effects. In severe cases of charge leakage, stuck bits may emerge, typically presenting as fixed patterns [17]. These stuck bits, which lose their ability to toggle between logic states and remain fixed at either 0 or 1, can be readily identified by Test 5.

As for cases of mild charge leakage, the retention time of some DRAM cells significantly decreases, leading to so-called weak bits. These bits are more susceptible to errors during the hammer test (Test 13). The hammer test involves writing a specific pattern into the DDR modules, followed by repeatedly accessing certain addresses within a brief period. Subsequently, the contents of other addresses in the same memory bank but different rows are readback to check for upset bits [10], [14], [18]. The combined effects of frequent access and accumulated irradiation result in significant charge leakage in adjacent rows, particularly affecting the weak bits and potentially causing upset bits [19]. By employing Test 13, we can investigate a correlation between the number of weak bits and the accumulated proton flux.

## C. Test Setup and Flow

Fig. 1 depicts the overall test setup. In the irradiation room, the four DUTs are securely mounted on the experimental platform. The testing software and output log files are stored on USB flash drives attached to the motherboards. The power and peripheral cables are connected to power supply units

| DUT       | Motherboard&CPU     | DDR module                 | ECC               | Main purposes                                 |
|-----------|---------------------|----------------------------|-------------------|-----------------------------------------------|
| CG-DDR5-K | ASUS Z790-H,        | KINGSTON, KVR48U40BSB-16,  | on-die            | 1) comparison to Micron                       |
|           | Intel 12th i3-12100 | Part No. HMCG78            |                   | 2) dependence on proton energy                |
|           |                     |                            |                   | 3) effect to PMU                              |
| CG-DDR5-A | Gigabyte Z790,      | ADATA, PC5-38400, Part No. | on-die            | 1) comparison to Hynix                        |
|           | Intel 12th i3-12100 | MT60B2G8                   |                   |                                               |
| SG-DDR5-K | ASUS B650-PLUS,     | KINGSTON,                  | on-die & external | 1) comparison between w/ and w/o external ECC |
|           | AMD 7600X           | KSM48E40BS8KM-16HM, Part   |                   | 2) comparison to DDR4 modules                 |
|           |                     | No. HMCG78                 |                   |                                               |
| SG-DDR4-K | ASUS X570-PLUS,     | KINGSTON, KSM32N22S8/16,   | external          | 1) compare external ECC cases                 |
|           | AMD 5600            | Part No. H5AG48            |                   | 2) comparison to DDR5 modules                 |

TABLE I DUT INFORMATION

 TABLE II

 Testing Procedure for Test 5 (Moving Inversions) [14]

| # | Phase          | Operation      | Details                           |
|---|----------------|----------------|-----------------------------------|
| 1 | Initialization | Pattern writ-  | Fill entire memory with random    |
|   |                | ing            | pattern                           |
| 2 | Forward pass   | Verification   | Start from lowest address; Check  |
|   |                | & Inversion    | discrepancies; Write inverse pat- |
|   |                |                | tern; Move to next higher address |
| 3 | Reverse pass   | Verification   | Start from highest address; Check |
|   |                | & Inversion    | discrepancies; Write inverse pat- |
|   |                |                | tern; Move to next lower address  |
| 4 | Continuation   | Pattern alter- | Repeat process with continuous    |
|   |                | ation          | pattern variation                 |



Fig. 1. Overall test setup. Four motherboards with DUTs are irradiated in turn. They are moved, configured, powered, and monitored remotely.

(labeled 1# and 2#) and a DUT selector (for high definition multimedia interface (HDMI) and USB connections). Two LAN cables facilitate communication between the counting room and the irradiation room. One is dedicated to the graphical user interface (GUI) display, while the other serves peripheral devices and remote controls. This configuration allows the four DUTs to conduct remote automatic tests, significantly enhancing experimental efficiency.

Experiments were conducted at the Cyclotron and Radioisotope Center (CYRIC) at Tohoku University, using a 2-cm diameter proton beam with an accelerated energy of 77.50 MeV. After passing through the diffuser, beam window, and room air, the beam energy is reduced, resulting in a 75.88-MeV proton beam at the DUT's surface when no external degraders are used. In addition, copper plates placed in the beamline serve as degraders to fine-tune the proton energy, as listed in Table III. The final proton energy is estimated using GEANT4 simulations, incorporating the specific beamline setup.

## TABLE III

PROTON ENERGY AVAILABLE TO DUTS VERSUS DEGRADER THICKNESS (GEANT4 CALCULATIONS). SPECTRUM WIDTH IS DETERMINED BY HALF-HEIGHT WIDTH OF THE ENERGY DISTRIBUTION

| Copper degrader thickness | Proton energy | Spectrum width |
|---------------------------|---------------|----------------|
| 0.00 mm                   | 75.88 MeV     | 0.33 MeV       |
| 2.70 mm                   | 60.36 MeV     | 1.07 MeV       |
| 5.00 mm                   | 44.13 MeV     | 1.72 MeV       |
| 6.80 mm                   | 27.43 MeV     | 2.76 MeV       |
| 7.00 mm                   | 24.91 MeV     | 2.96 MeV       |
| 7.20 mm                   | 22.21 MeV     | 3.19 MeV       |
| 7.41 mm                   | 19.58 MeV     | 3.56 MeV       |
| 6.51 mm                   | 17.95 MeV     | 3.87 MeV       |
| 7.61 mm                   | 16.39 MeV     | 4.24 MeV       |
| 7.80 mm                   | 13.23 MeV     | 4.98 MeV       |
| 8.20 mm                   | 4.42 MeV      | 6.47 MeV       |

Four motherboards are positioned in parallel, with the backside of all DDR modules aligned perpendicularly to the beam direction. To mitigate the impact of scattered protons, motherboards are equipped with one or two extenders produced by M-FACTORS Storage, adding additional height to decrease the effect on the CPUs. Only two chips per module are exposed without any protection, such as paraffin blocks. The neighboring chips are safeguarded by 9-mm thick copper shields, the necessary thickness of which is calculated based on stopping and range of ions in matter (SRIM) simulations [20]. The interference from scattered protons and secondary neutrons was not visible, and no upsets were observed in the neighboring DRAM chips in experiments. Notably, during PMU experiments, memory chips adjacent to the PMU chip are fully shielded.

The test flow consists of five parts, as follows.

1) Proton Energy Sweep, CG-DDR5-K, Test 5: The proton energy ranged from 4.42 to 75.88 MeV. Due to the small CS ( $\sigma$ ) of DRAM chips [21], [22], the total valid fluence at each energy point was set to exceed 10<sup>11</sup> p/cm<sup>2</sup>. To mitigate the interference from accumulated effects, each module underwent retesting after exposure to each proton energy level. Modules need to be replaced upon the detection of any errors, but such replacement scenarios did not occur.

2) Vendor Comparison, CG-DDR5-K and -A, Test 5: Two energy points were chosen: 60.36 and 75.88 MeV.

3) PMU, CG-DDR5-K, Test 5: Only PMU was under irradiation at 75.88 MeV.

4) ECC for SG-DDR5-K and SG-DDR4-K, Test 5: Both the enabled and disabled ECC cases were tested at 75.88 MeV.

## III. EXPERIMENTAL RESULTS FOR CONSUMER-GRADE DDR5 MODULES

## A. Proton Energy Sweep for Consumer-Grade DDR5

Fig. 2 depicts the SEE CS ( $\sigma_{\text{SEE}}$ ) curves for CG-DDR5-K as a function of the proton energy used for irradiation. Sys\_SEFIs are classified as either a system hang or an automatic system reboot, which signifies unrecoverable errors. DRAM\_SEEs are classified as DRAM\_SEFIs and only a few number of DRAM SEUs. DRAM SEFIs manifest as numerous consecutive upsets (CUs) over time, including continuous SBUs or MBUs. In some CUs, the upset addresses occur at regular intervals, whereas, in others, no such pattern is observed. In addition, CUs are only observed in a single round of testing. DRAM\_SEFIs may persist for several seconds before recovery, sometimes until a sys SEFI occurs. A few number of DRAM\_SEUs manifest as SBUs and double cell upsets (DCUs). A DCU consists of two SBUs observed in cells whose logical addresses differ by 8, 16, or 24. The  $\sigma$  of sys\_SEFI and DRAM\_SEE ( $\sigma_{sys\_SEFI}$  and  $\sigma_{DRAM\_SEE}$ , per chip) are calculated using their valid fluence ( $\phi_{sys\_SEFI}$  and  $\phi_{DRAM\_SEE}$ , p/cm<sup>2</sup>) and the event number ( $N_{\text{sys SEFI}}$  and  $N_{\text{DRAM SEE}}$ ) of the two chips per module during irradiation. Because there are about 10 s of loading test programs (deadtime only for DRAM\_SEE tests), the valid fluence for calculating  $\sigma_{sys\_SEFI}$  is always larger than that for  $\sigma_{\text{DRAM}_{\text{SEE}}}$ , but both of them are more than  $1 \times 10^{11}$  p/cm<sup>2</sup>. It is evident that both  $\sigma_{sys\_SEFI}$  [see Fig. 2(a)] and  $\sigma_{\text{DRAM}_{\text{SEE}}}$  [see Fig. 2(b)] exhibit peaks at the similar proton energy near 25 MeV. The total  $\sigma$  ( $\sigma_{\text{total}} = \sigma_{\text{sys}\_\text{SEFI}}$  +  $\sigma_{\text{DRAM SEE}}$ ) displays a more pronounced peak, as illustrated in Fig. 2(c). The error bars, corresponding to 1  $\sigma$  uncertainty, are calculated using the following equations, in which 0.5 comes from the setup that two chips were irradiated:

$$\sigma_{\text{error, sys\_SEFI}} = \frac{\sqrt{0.5 * N_{\text{sys\_SEFI}}}}{\phi_{\text{sys\_SEFI}}} \tag{1}$$

 $\sqrt{0.5 * N_{\text{DRAM}\_\text{SEE}}}$ 

$$\sigma_{\text{ror, DRAM_SEE}} = \frac{\sqrt{\sigma^2 + \gamma_{\text{DRAM_SEE}}}}{\phi_{\text{DRAM_SEE}}}$$
(2)

$$\sigma_{\rm error, \ total} = \sqrt{\sigma_{\rm error, \ sys\_SEFI}^2 + \sigma_{\rm error, \ DRAM\_SEE}^2}$$
(3)

As reported in the literature, these peaks result from direct ionization by the low-energy protons [5]. The proton energy at the  $\sigma$  peak observed in experiments significantly exceeds the proton's Bragg peak energy (approximately 55 keV, according to SRIM simulation [20]) in silicon, attributed to backside irradiation and the thickness of the printed circuit board (PCB) structure, as depicted in Fig. 3. Furthermore, the proton spectrum is substantially broadened after passing through the degrader (see Table III) and the PCB structure. Consequently, the effective proton flux rate near the Bragg peak energy within the sensitive volume is considerably diminished. Therefore, unlike at other energy points,  $\sigma$  at the peak does not exhibit an increase by orders of magnitude. The relatively high  $\sigma$  near the peak is also influenced by the broadened spectrum.



Fig. 2.  $\sigma_{SEE}$  versus proton energy for consumer-grade DDR5 modules (CG-DDR5-K). (a)  $\sigma_{sys\_SEFI}$  per chip versus proton energy. (b)  $\sigma_{DRAM\_SEE}$  per chip versus proton energy. (c)  $\sigma_{total}$  per chip versus proton energy.

Notably, the  $\sigma_{DRAM\_SEE}$  curve [see Fig. 2(b)] exhibits a secondary peak at an energy of 19.58 MeV. This phenomenon is likely attributable to inhomogeneities between the chips and the PCB. As depicted in Fig. 3, DDR modules employ flip-chip packaging technology, characterized by an array of solder bumps interspersed with periodic air gaps between the chips and PCB. The solder bumps, approximately 220- $\mu$ m thick and denser than air, induce an inhomogeneous region



Fig. 3. Cut-off section of CG-DDR5-K.

that scatters incident protons into two distinct energy spectra. Specifically, for protons with the same initial energy, the ones passing through the air gaps tend to have higher average energy inside the DRAM chips than those passing through the solder bumps. Consequently, Bragg's peak energy within the sensitive volume is represented by the two distinct incident proton energy levels. The secondary peak emerges from the incident lower-energy protons that pass through the air gaps.

To elucidate the cause of the dual peaks observed in  $\sigma_{\text{DRAM SEE}}$ , we employed a simulation analysis on it by using Geant4 [23]. The geometric model for simulation, depicted in Fig. 4, utilizes the rectangular parallelepiped (RPP) approximation. According to this model, supposing a DRAM\_SEE is triggered when an incident proton or its secondary particles deposit energy exceeding the threshold energy  $(E_{\rm th})$  in a sensitivity volume (SV). To make the simulation result closely match the observed  $\sigma$  ratio between the peak attributable to direct ionization and the tail resulting from elastic and inelastic interactions, the  $E_{\rm th}$  value was set to 50 keV, which is equivalent to a critical charge  $(Q_c)$  of 2.2 fC. The SV depth was adjusted to 0.4  $\mu$ m, as reported in [24]. When  $Q_c$  is increased during simulation, the  $\sigma$  peak experiences a significant decrease because it becomes more challenging for the energy deposited from the direct ionization of a proton to exceed the DRAM\_SEE threshold. However, the tail of  $\sigma$  is relatively less sensitive to the change in  $Q_c$  until it exceeds the value that is equivalent to the large amount of energy deposited by elastic and inelastic interactions. As a result, the insignificant peak-to-tail ratio, which is shown in Fig. 2, suggests that  $Q_c$  of the DDR5 should also be at a relatively high value, such as 2.2 fC. For the characterization of  $\sigma_{\text{DRAM SEE}}$ , we organized a matrix of 2-Gb SVs, each with an area of  $80 \times 80$  nm, spaced 167 nm apart within the sensitive layer, to ensure the generation of a sufficient number of DRAM SEE events. The simulated  $\sigma$  was normalized by applying a scaling factor k such that the magnitude of the simulated peak reproduces the experimental one.

In the simulation, we initially designated the material of the bump layer as solder, with the simulation outcome illustrated by the green solid line in Fig. 5. Compared to the measurement data, indicated by red dots in Fig. 5, the direct ionization peak from the simulation ( $E_{p-sol}$ ) for solder material aligns with the right experimental peak ( $E_{p2}$ ). The magnitudes of both the direct ionization peak and the tail from elastic and inelastic interactions closely match the observed measurements. Here, the measured tail from 44.13 MeV is coherent with a Weibull



Fig. 4. Geometric model structure used in Geant4 simulation.



Fig. 5.  $\sigma_{DRAM_SEE}$  derived from Geant4 simulations using the RPP approximation is compared with measurement data in Fig. 2(b). The green solid line represents simulations using solder as the bump layer material. The yellow curve indicates the contribution from inelastic interactions, while the purple curve refers to elastic interactions. The gray dotted line shows the total from both elastic and inelastic interactions, and the blue dashed line corresponds to simulations where the bump layer material is set to air.

function as follows:

$$\sigma_{\text{SEE}}(E_p) = \sigma_{\infty} \left( 1 - \exp\left(-\frac{E_p - E_0}{W}\right)^S \right) \tag{4}$$

where  $\sigma_{\infty} = 2.9 \times 10^{-11}$  cm<sup>2</sup>/chip,  $E_0 = 20$  MeV, W = 35 MeV, and S = 1.4. Subsequently, altering the bump layer material to air changes the simulation result, depicted by the blue dashed curve in Fig. 5. The simulated  $\sigma$  peak ( $E_{p-air}$ ) with the bump layer material such as air aligns with the left measured  $\sigma_{DRAM_{SEE}}$  peak ( $E_{p1}$ ). This higher peak results from the reduced energy straggling of air for protons compared to solder.

### B. Vendor Comparison

Fig. 6 presents the results for CG-DDR5-A and CG-DDR5-K at two specific energy levels. As mentioned, both are consumer-grade modules with identical capacities. The findings suggest that the sensitivity of modules from both vendors is roughly equivalent. For each energy level, the error bars for  $\sigma_{\text{sys}\_\text{SEFI}}$  and  $\sigma_{\text{DRAM}\_\text{SEE}}$  of both vendors show overlapping regions, with some reaching as much as 80% overlap (e.g.,  $\sigma_{\text{sys}\_\text{SEFI}}$  at 60.36 MeV). In addition,  $\sigma_{\text{total}}$  for both vendors is comparably close. The observed similarity in  $\sigma$  across



Fig. 6.  $\sigma_{SEE}$  of different vendors at 60.36 and 75.88 MeV (SG-DDR5-K and SG-DDR5-A). Note that error bars are determined by (1)–(3).

different vendors is likely attributed to their chip structures. An analysis of test data from Tech Insights Inc. has shown that typical Micron and Hynix chips (DDR5-4800 MHz) possess comparable die-chip and DRAM cell dimensions [25]. While there may be minor sensitivity discrepancies between them, the inherently low  $\sigma_{SEE}$  of consumer-grade DRAM chips presents a significant challenge in distinguishing these differences in our experiments.

## C. SEEs of PMU

Fig. 7 displays the outcomes of PMU irradiation experiments for CG-DDR5-K. To account for sample variation, two modules were subjected to proton irradiation at an energy level of 75.88 MeV. The results demonstrate consistent outcomes between the two modules, in terms of both the types of observed errors and their corresponding  $\sigma$ . Notably, both modules exclusively triggered system SEFIs without any memory upsets. This phenomenon can be attributed to the global impact of SEEs on the PMU within DDR5 modules. The PMU is included to reduce unnecessary power loss and improve overall system performance. It generates multiple voltage rails, such as VDD, VDDQ, and VPP, which are essential for powering the memory chips and related components. In addition, the PMU monitors and regulates the current consumption of each rail to ensure optimal power delivery and efficiency. When SEEs affect PMUs, they cause widespread interference, impacting not just the memory cells but also other circuits such as core decoders and peripheral systems. Consequently, when such circuits are impacted, the system is more likely to encounter a critical SEFI, precipitating the cessation of the system test.

There is a typical system SEFI scenario induced by the PMU during the experiments, representing the most common SEFI occurrence where the system is capable of automatic rebooting. Notably, a distinct category of critical system SEFIs emerges after prolonged irradiation, marked by the test system's inability to reboot without cycling the power supply and BIOS reconfiguration. This severe form of SEFIs was observed three times across the two modules.



Fig. 7.  $\sigma_{\text{SEE}}$  of PMU for CG-DDR5-K modules. Two modules were tested at 75.88 MeV. Only SEFIs are detected during irradiation.

Comparing Figs. 6 and 7, it becomes clear that  $\sigma_{PMU}$  is lower than  $\sigma_{total}$  for a single DRAM chip (2 Gb), and consequently, it is much lower than  $\sigma_{total}$  for all eight chips on the module. This discrepancy can be attributed to the larger die area and higher transistor density of the DRAM chips. However, the occurrence of SEFIs induced by the PMU should not be overlooked, particularly the critical SEFI type that severely impacts the system's functionality. Given that the PMU is a distinctive feature of DDR5 modules [1], its susceptibility to radiation could pose a unique challenge in radiation-prone environments. This underscores the need for continued research and focused attention on this aspect.

## D. Error Patterns

CG-DDR5-A and CG-DDR5-K not only exhibit similar  $\sigma$  but also share error patterns. Both vendors' modules feature a small number of SBUs and DCUs, with a predominance of DRAM\_SEFIs (or CUs). Table IV provides a summary of the DRAM\_SEE error patterns observed in CG-DDR5-A and CG-DDR5-K. According to Table IV, SBUs, DCUs, and CUs represent 4%, 6%, and 90% of the errors, respectively. The CUs comprise continuous SBUs (Type 1, accounting for 32%) and continuous MBUs (Type 2, 42%, and Type 3, 16%). Notably, 65% of the CUs do not recover until an SEFI occurs, with some lasting even more than 10 s. The relationship between DRAM\_SEFIs and sys\_SEFIs should be further investigated and revealed in future work.

Table IV reveals that DDR5 modules are still susceptible to SEUs despite the presence of on-die ECC. DRAM chips are comprised of cell arrays, core decoders, and peripheral circuits [1]. It is reasonable to infer that the observed SBUs and DCUs are primarily due to data disturbances within the cell arrays. The logical address difference between the two upset bits of DCUs (e.g., 8, 16, or 24) should be a hint that those DCUs would consist of two physically adjacent upsets induced by a single event. This inference is based on the understanding that upsets from decoders or peripheral circuits typically manifest as continuous or burst errors [11]. In addition, there are usually mappings between logical and physical addresses. With the internal on-die ECC mechanism, each 128-bit block of data is

| Error patterns                 | Errors continue until | Pattern description                                               | Logical address                 | Percentage |
|--------------------------------|-----------------------|-------------------------------------------------------------------|---------------------------------|------------|
|                                | sys_SEFIs happen?     |                                                                   | features                        | (N=51)     |
| Single-bit upset (SBU)         | No. Only one time     | One event consists of a single-bit upset                          | -                               | 4%         |
| <b>Double-cell upset (DCU)</b> | No. Only one time     | One event consists of two single-bit upsets from correlated cells | $\Delta = 8,16, \text{ or } 24$ | 6%         |
| DRAM_SEFI (Continu-            | Yes (65%), No (35%)   | Type 1, Continuous SBUs: One event consists of numerous           | no tendency                     | 32%        |
| ous Upsets)                    |                       | continuous single-bit upsets from multiple words                  |                                 |            |
|                                |                       | Type 2, Continuous MBU-1: One event consists of numerous          | $\Delta = 4$                    | 42%        |
|                                |                       | continuous multiple-bit upsets with random upset bits in a word   |                                 |            |
|                                |                       | Type 3, Continuous MBU-2: One event consists of numerous          | no tendency                     | 16%        |
|                                |                       | continuous multiple-bit upsets with fixed upset bits in a word    |                                 |            |

TABLE IV DRAM\_SEE ERROR PATTERNS FOR CONSUMER-GRADE DDR5 MODULES (CG-DDR5-A AND CG-DDR5-K)

safeguarded by eight ECC check bits, enabling the correction of any SBU [2], which will be further elaborated in Section V. Therefore, if a proton-induced SEU affects only a single bit within any cell array, it would be corrected by the on-die ECC, rendering it undetectable in our experiments. However, ondie ECC is incapable of fully correcting MBUs within these 128 bits of data. Consequently, it is speculated that the detected SBUs and DCUs might stem from MBUs caused by proton irradiation in the cell arrays.

Furthermore, the number of CU events constitutes the majority of SEUs (90% in Table IV), significantly outnumbering the combined total of SBUs and DCUs. It is plausible that CUs originate from disturbances in decoders or peripheral circuits, given their characteristics of continuity and burstiness, which distinctly differ from the proton-induced upsets observed in cell arrays. Proton-induced events within decoders or peripheral circuits can lead to addressing errors. In scenarios where cell arrays are programmed with homogeneous patterns of all ones or zeros, such addressing errors might be masked. However, CUs are more likely to manifest during tests involving random patterns in the cell arrays [11], like Test 5 used in our experiments. The inherent randomness of these test patterns improves the ability to expose addressing errors, thereby resulting in CUs.

## IV. EXPERIMENTAL RESULTS FOR SERVER-GRADE DDR4 AND DDR5 MODULES

## A. SEEs

Fig. 8 presents  $\sigma_{\text{SEE}}$  for SG-DDR5-K and SG-DDR4-K with the external ECC both enabled and disabled. When ECC is disabled, proton-induced SEEs in both modules predominantly result in DRAM\_SEEs, with only a minimal occurrence of sys\_SEFIs. Therefore,  $\sigma_{\text{total}}$  is primarily composed of DRAM\_SEEs. Notably, DDR5 demonstrates a marginally lower  $\sigma_{\text{DRAM}_\text{SEE}}$  compared to DDR4, likely due to its on-die ECC capability to correct proton-induced SBUs. Conversely, with ECC enabled, no DRAM\_SEEs are observed in either module, but there is a notable increase in sys\_SEFIs, especially in DDR5. Consequently,  $\sigma_{\text{total}}$  is equivalent to  $\sigma_{\text{sys}_\text{SEFI}}$  in scenarios where ECC is enabled.

Indeed, the presence of external ECC acts as an additional primary mechanism for SEFIs. External ECC requires coordination between the DDR module and the CPU. For server-grade DDR4 and DDR5 modules, each 64-bit block of original data is protected by eight external ECC check bits.



Fig. 8.  $\sigma_{\text{SEE}}$  for SG-DDR4-K and SG-DDR5-K with external-ECC disabled and enabled. Proton energy is 75.88 MeV.

During data writing, these check bits are generated by the CPU and stored in the DRAM chips alongside the original data. During retrieval, the CPU simultaneously reads and decodes both the original data and the ECC check bits [26]. The single error correction and double error detection (SECDED) algorithm, a common form of external ECC, is capable of correcting any SBUs and detecting up to 2-bit upsets [14], [27]. Therefore, SECDED can correct any SBUs within the 64-bit data block. However, SECDED is unable to fully correct MBUs. If the CPU detects an error that cannot be corrected, it can lead to SEFIs, such as kernel panics or blue screen errors [14]. Consequently, MBUs that might pass in the ECC-disabled scenario become uncorrectable errors in the ECC-enabled scenario, resulting in sys\_SEFIs.

The significant discrepancy of  $\sigma_{sys\_SEFI}$  between DDR4 and DDR5 in the enabled ECC case could be interpreted through their SEU patterns. Table V counts the DRAM\_SEE error patterns of SG-DDR4-K and SG-DDR5-K in the disabled ECC case. It can be found that there are no MBUs for DDR4, while MBUs occupy 75% of all SEUs for DDR5. This is a conspicuous discrepancy between DDR4 and DDR5. These MBUs from DDR5 could still be uncorrectable in the enabled ECC case, primarily contributing to the dramatic increase in  $\sigma_{sys\_SEFI}$ . In particular, we found almost all SEUs for the two modules are CUs, as presented in Table IV. Thus, decoders and peripheral circuits are implied to be the main sensitive regions, which also results in the discrepant  $\sigma$  between SG-DDR4-K

TABLE V DRAM\_SEE ERROR PATTERNS FOR SERVER-GRADE DDR4 AND DDR5 MODULES WITH EXTERNAL ECC DISABLED

| Module | Valid               | fluence  | #     | of    | Ratio   | of  | single  | SBUs    | to | all |
|--------|---------------------|----------|-------|-------|---------|-----|---------|---------|----|-----|
|        | $/\mathrm{cm}^{-2}$ |          | DRAM_ | _SEEs | (single | e + | continu | ious) S | BU | 5   |
| DDR4   | 3.12×1              | $0^{11}$ | 40    |       | 100%    |     |         |         |    |     |
| DDR5   | $1.70 \times 10$    | $0^{11}$ | 12    |       | 25%     |     |         |         |    |     |

and SG-DDR5-K. This result suggests that the SG-DDR5-K still requires additional hardening designs.

The SG-DDR4-K  $\sigma_{\text{sys_SEFI}}$  in the two cases with ECC and without ECC is  $(3.00 \pm 2.12) \times 10^{-12} \text{ cm}^{-2}$  (sys\_SEFI count is 2) and  $(8.60 \pm 4.30) \times 10^{-12} \text{ cm}^{-2}$  (sys\_SEFI count is 4) in Fig. 8, respectively, with overlapping error bars. Since there are no observed MBUs for SG-DDR4-K in the disabled ECC case, most SEUs are expected to be corrected in the enabled ECC case, and sys\_SEFI due to SECDED failures does not arise, expecting similar  $\sigma_{\text{sys}}$  SEFI values.

In addition, the introduction of on-die ECC might be a secondary reason for increasing the  $\sigma_{sys\_SEFI}$  of DDR5 in the enabled ECC case. Not like the external ECC, every 128 bits of data for on-die ECC are derived from a 1-B width within a burst length (BL = 16, from BL0 to BL15). Thus, the correction mechanism of on-die ECC spans 16 B along the burst sequence. If MBUs occur in different bytes (e.g., BL0, SBU, and BL1, SBU), on-die ECC has a possibility of wrongly decoding, which may introduce an extra new upset bit [2]. Such examples will be given by fault injections in Section V. Thus, there could be a new MBU within a byte (e.g., BL = 0, from SBU to MBU), which is uncorrectable in the enabled external ECC case, probably resulting in sys\_SEFIs.

Furthermore, the comparison of Figs. 2 and 8 reveals differences in  $\sigma_{SEE}$ , especially  $\sigma_{sys\_SEFI}$ , between CG-DDR5-K and SG-DDR5-K (disabled ECC). It can be found that the server-grade module demonstrates a lower sensitivity to protons. If it is based on the fact that the basic die chips comprising the modules CG-DDR5-K and SG-DDR5-K are the same, then this is indeed a special phenomenon, and the underlying reasons need to be further investigated.

### **B.** Accumulated Effects

Fig. 9(a) and (b) presents the outcomes of testing for accumulated effects on SG-DDR4-K and SG-DDR5-K, respectively. Both types of modules show sensitivity to accumulated irradiation, evidenced by the presence of stuck bits and weak bits (defined in Section II-B2). Given that the mechanisms behind stuck bits and weak bits have been extensively explored (e.g., [10], [13]), this discussion will not cover those details. Instead, our analysis is centered on identifying and understanding the disparities between DDR4 and DDR5 modules in response to accumulated proton flux. Notably, DDR4 demonstrates a significantly higher sensitivity than DDR5 as the accumulated proton flux increases. At similar levels of accumulation flux, the number of failed bits in DDR4 is greatly larger than in DDR5, across both Tests 5 and 13.

The on-die ECC of DDR5 modules plays a crucial role in this difference. For convenience, we define each 128-bit



Fig. 9. Number of failed bits per chip (2 Gb) versus the proton fluence for DDR4 and DDR5 modules. Proton energy was 75.88 MeV. Both Tests 5 and 13 were performed after irradiation in the ECC-disabled case. (a) SG-DDR4-K. (b) SG-DDR5-K.

data block along with its eight on-die ECC check bits as one on-die ECC segment. Thus, if stuck (or weak) bits are randomly distributed across the 4-Gb data and its on-die ECC check of 0.25 Gb, the probability of two failed bits occurring simultaneously within the same on-die ECC segment can be determined by the following equation:

$$\operatorname{PROB}_{K}(n, p)(\operatorname{Collision}) \approx 1 - e^{\left(\frac{-p(p-1)(2k-1)}{2n}\right)}$$
(5)

where

1) n = (Size of Total Population) = 4.25 Gb;

- 2) k =(Collision Range assumed as half of an on-die ECC segment) =  $0.5 \times (128 + 8) = 68$ ;
- 3) p = (number of random memory upsets) = 2.

Substituting these values yields extremely small exponents  $(e^{-3.18 \times 10^{-8}})$ , indicating that this failed pattern is rare. The most frequent pattern is SBUs from different on-die ECC segments, which can be successfully corrected by on-die ECC. Therefore, the number of detected failed bits for DDR5 is quite low in both tests.

For instance, as illustrated in Fig. 9(a), at a fluence of  $3.25 \times 10^{12}$  p/cm<sup>2</sup>, DDR4 showed all 8386 stuck bits and 3045 weak bits for two chips in experiments. Utilizing the

failed bit count from DDR4, we performed a simple fault injection simulation, hypothesizing that 10000 failed bits randomly appear within a 4.25-Gb range. This simulation with 100 trials shows that the occurrence of MBUs within a single on-die ECC segment is at most nine times. The implementation of on-die ECC is projected to reduce the number of failed bits by more than three orders of magnitude. It should be noticed that if the weak or stuck cells do not appear randomly but adjacently, especially in an on-die or external ECC segment, they represent uncorrectable errors for the popular SECDED algorithm and probably introduce an extra upset bit (as discussed in Section V) and even a sys\_SEFI [14].

However, the experimental results demonstrated that the failed bit number did not exhibit three orders of magnitude difference between DDR4 and DDR5 (about two orders of magnitude) in Fig. 9. The disparity of the failed bit number between experimental results and simulation data may be linked to the increased sensitivity of DRAM cells in DDR5 compared to DDR4. As technology progresses from DDR4 to DDR5, the chip dimensions are reduced, which enhances speed and capacity [29]. Meanwhile, this shift to smaller process technology also increases the sensitivity and likelihood of disturbances within the DRAM cells [30]. For instance, continuous oxide scaling and metal pitch scaling cause the degradation of cell retention time [1]. Such elevated vulnerabilities are why the industry has integrated on-die ECC in DDR5 modules to mitigate these issues.

## V. FAULT INJECTIONS TO ON-DIE ECC SEGMENT

To further elucidate the mechanism behind the SBU and DCU patterns observed in DDR5, as detailed in Table IV, and considering the role of on-die ECC, this section performs fault injection simulations on a 128-bit data block accompanied by eight ECC check bits.

## A. On-Die ECC Structure

As previously mentioned, the internal on-die ECC in DDR5 utilizes each block of 128-bit data (data[0:127]) to generate its eight ECC check bits (CB[0:7]). DDR5 modules feature a prefetch mechanism, with the corresponding burst lines being 16. In the case of the x8 DDR5 devices utilized in our experiments, each 128-bit data block processed by the on-die ECC corresponds to one prefetch cycle, with a byte width of DQ[0:7], as illustrated in Fig. 10. The ECC check bit generation block is defined by an H matrix, which is also employed in the decoding process. However, the actual implementation of the H matrix in DRAM chips varies and remains confidential among manufacturers, with some employing Hamming codes [31].

For the simulations, a Hamming code was utilized to assess the prospective impact of on-die ECC failure on error patterns, rather than to replicate measurement results. The simulations employed the same implementation as the open-source *Hamming.py* file [27], which includes functions for ECC check bit generation [*encode* (*data*)] and error detection/correction [*decode* (*data* + *ECC*)]. The bit map for the Hamming

|      |      | data pins/ DQ |     |     |     |     |     |     |     |  |  |
|------|------|---------------|-----|-----|-----|-----|-----|-----|-----|--|--|
|      | X8   | DQ0           | DQ1 | DQ2 | DQ3 | DQ4 | DQ5 | DQ6 | DQ7 |  |  |
|      | BLO  | 0             | 8   | 16  | 24  | 32  | 40  | 48  | 56  |  |  |
|      | BL1  | 1             | 9   | 17  | 25  | 33  | 41  | 49  | 57  |  |  |
|      | BL2  | 2             | 10  | 18  | 26  | 34  | 42  | 50  | 58  |  |  |
|      | BL3  | 3             | 11  | 19  | 27  | 35  | 43  | 51  | 59  |  |  |
|      | BL4  | 4             | 12  | 20  | 28  | 36  | 44  | 52  | 60  |  |  |
|      | BL5  | 5             | 13  | 21  | 29  | 37  | 45  | 53  | 61  |  |  |
| es   | BL6  | 6             | 14  | 22  | 30  | 38  | 46  | 54  | 62  |  |  |
| ≣    | BL7  | 7             | 15  | 23  | 31  | 39  | 47  | 55  | 63  |  |  |
| Irst | BL8  | 64            | 72  | 80  | 88  | 96  | 104 | 112 | 120 |  |  |
| a    | BL9  | 65            | 73  | 81  | 89  | 97  | 105 | 113 | 121 |  |  |
|      | BL10 | 66            | 74  | 82  | 90  | 98  | 106 | 114 | 122 |  |  |
|      | BL11 | 67            | 75  | 83  | 91  | 99  | 107 | 115 | 123 |  |  |
|      | BL12 | 68            | 76  | 84  | 92  | 100 | 108 | 116 | 124 |  |  |
|      | BL13 | 69            | 77  | 85  | 93  | 101 | 109 | 117 | 125 |  |  |
|      | BL14 | 70            | 78  | 86  | 94  | 102 | 110 | 118 | 126 |  |  |
|      | BL15 | 71            | 79  | 87  | 95  | 103 | 111 | 119 | 127 |  |  |

Fig. 10. X8 burst order and DQ map versus code word [2]. Note that BL[0:15] and DQ[0:7] are burst lines for the prefetch and data pins.

|     | 000  | 004   | •      | 000    | •   |     |     | 000 |      |      | 40  | 40  |      | •    |
|-----|------|-------|--------|--------|-----|-----|-----|-----|------|------|-----|-----|------|------|
| Na  | CB0  | CB1   | 0      | CB2    | 8   | 16  | 24  | CB3 | 32   | 40   | 48  | 46  | 1    | 9    |
| CB4 | 17   | 25    | 33     | 41     | 49  | 57  | 2   | 10  | 18   | 26   | 34  | 42  | 50   | 58   |
| CB5 | 3    | 11    | 19     | 27     | 35  | 43  | 51  | 59  | 4    | 12   | 20  | 28  | 36   | 44   |
| 52  | 60   | 5     | 13     | 21     | 29  | 37  | 45  | 53  | 61   | 6    | 14  | 22  | 30   | 38   |
| CB6 | 46   | 54    | 62     | 7      | 15  | 23  | 31  | 39  | 47   | 55   | 63  | 64  | 72   | 80   |
| 88  | 96   | 104   | 112    | 120    | 65  | 73  | 81  | 89  | 97   | 105  | 113 | 121 | 66   | 74   |
| 82  | 90   | 98    | 106    | 114    | 122 | 67  | 75  | 83  | 91   | 99   | 107 | 115 | 123  | 68   |
| 76  | 84   | 92    | 100    | 108    | 116 | 124 | 69  | 77  | 85   | 93   | 101 | 109 | 117  | 125  |
| CB7 | 70   | 78    | 86     | 94     | 102 | 110 | 118 | 126 | 71   | 79   | 87  | 95  | 103  | 111  |
| 119 | 127  | Na    | Na     | Na     | Na  | Na  | Na  | Na  | Na   | Na   | Na  | Na  | Na   | Na   |
| Na  | Na   | Na    | Na     | Na     | Na  | Na  | Na  | Na  | Na   | Na   | Na  | Na  | Na   | Na   |
| Na  | Na   | Na    | Na     | Na     | Na  | Na  | Na  | Na  | Na   | Na   | Na  | Na  | Na   | Na   |
| Na  | Na   | Na    | Na     | Na     | Na  | Na  | Na  | Na  | Na   | Na   | Na  | Na  | Na   | Na   |
| Na  | Na   | Na    | Na     | Na     | Na  | Na  | Na  | Na  | Na   | Na   | Na  | Na  | Na   | Na   |
| Na  | Na   | Na    | Na     | Na     | Na  | Na  | Na  | Na  | Na   | Na   | Na  | Na  | Na   | Na   |
|     |      |       |        |        |     |     |     |     |      |      |     |     |      |      |
| CB0 | -CB7 | 8 ECO | C chec | k bits |     | 0-1 | L27 | 128 | data | bits |     | Na  | no e | data |

Fig. 11. Bit map of the Hamming code for the *hamming.py* script in [27]. The 16 B from BL0 to BL15 in Fig. 10 are allocated sequentially. The "Na" in purple at the top-left corner is used to check the whole table, while this bit is empty for the on-die ECC. Eight ECC check bits are CB0–CB7.

code used is depicted in Fig. 11. Here, the 128-bit data from Fig. 10 and the eight ECC check bits are sequentially arranged. Notably, the eight ECC check bits, while redundant for safeguarding the 128-bit data, can protect up to 216 bits of data. This includes the 128-bit data and an 88-bit "Na" region, excluding the single bit at the top left corner. The 88-bit "Na" region falls within the coverage area of the eight-bit ECC but remains unused and does not influence data correction. Here, the "Na" bit at the top left corner of Fig. 11 typically serves to verify the entire table in the SECDED scenario; however, it is not utilized in the on-die ECC configuration. Without this bit, the *decode (data* + *ECC)* function is capable of detecting and correcting up to one error within the total 136 (128 + 8) bits.

## B. Fault Injection Flow

In the fault injection simulations, we generate 128-bit data blocks with random patterns for each iteration. These 128 bits are placed in the specific positions outlined in Fig. 10, composing a prefetch *data* block (16 B). The eight ECC check bits are then generated using the *encode* (*data*) function. We focus solely on 2-bit upset scenarios within the total 136 bits (128 data bits + 8 ECC bits). Following fault injection, the 136-bit block is processed using the *decode* (*data* + *ECC*) function. Finally, the 16-B decoded data block is compared to

TABLE VI Results of Injecting Two Random (or Adjacent) Upsets in the 128-bit Data

| Injection        | 2*SBU  | MBU(2) | 3*SBU  | MBU(2) & | MBU(3) |
|------------------|--------|--------|--------|----------|--------|
| cases            |        |        |        | SBU      |        |
| Random case      | 13.76% | 1.66%  | 74.04% | 10.27%   | 0.23%  |
| Adjacent case    | 47.63% | 0      | 45.32% | 7.05%    | 0      |
| $(\Delta=1)^{a}$ |        |        |        |          |        |

<sup>a</sup>Adjacent case assumes that the adjacent number of bit cells in Fig. 10 are physically adjacent (e.g., Data[0] and Data[1] corresponding to DQ0 at BL0 and BL1).

the original to identify and count the types of errors in each byte.

To delve deeper into how the positions of upsets influence the types of errors, four fault injection scenarios are designed. The first scenario involves injecting two random (or adjacent) upsets solely into the 128-bit data block. The second scenario introduces a combination of one stuck upset and one random upset within the 128-bit data. In the third scenario, two random (or adjacent) upsets are targeted exclusively at the eight ECC check bits. The last scenario introduces one random upset within both the 128-bit data and the eight ECC check bits.

## C. Simulation Results

1) Two Random (or Adjacent) Upsets in the 128-bit Data: Table VI presents a summary of the error types resulting from 2-bit upset injections into the 128-bit data. The total number of injections was  $1 \times 10^5$  for both random and adjacent cases. Across the numerous random injections, five distinct error types across the 16 B were identified.

- 2\*SBU: Among the 16 B, two experience an SBU, corresponding to the DCU in Table IV.
- *MBU*(2): Only 1 B experiences an MBU with a 2-bit upset.
- *3\*SBU:* 3 of 16 B experience an SBU.
- *MBU*(2)&*SBU*: 1 B experiences an MBU(2), while another experiences an SBU.
- MBU(3): 1 B experiences an MBU with a 3-bit upset.

In practice, the likelihood of two independent random upsets occurring is expected to be very low, as upsets tend to happen as physically adjacent bit upsets along the energetic particle track. Meanwhile, the information about the physical bit locations is not available. Therefore, we made a simple assumption that the adjacent number of bit cells in Fig. 10, like data[0] and data[1] corresponding to DQ0 at BL0 and BL1 in Fig. 10, is physically adjacent. Then, adjacent upset injections were conducted accordingly. Here, we do not insist that this assumption is valid in actual chips. On the other hand, it is important to acknowledge that for adjacent injections, we consider bits in different BLs, namely, bytes as physically adjacent. Therefore, a physically adjacent 2-bit upset is originally represented as two SBUs from adjacent bytes.

Table VI shows how error types and distributions vary depending on the fault injection scenarios. In the case of adjacent injections, only combinations of 2\*SBU, 3\*SBU, and MBU(2)&SBU occur. Notably, 2\*SBU has the highest probability of 47.63%. The 2\*SBU in the simulations

TABLE VII Results of Injecting One Stuck Upset and One Random Upset in the 128-bit Data

| Error patterns and counts  | 2*SBU  | MBU(2) | 3*SBU  | MBU(2) | MBU(3)      |
|----------------------------|--------|--------|--------|--------|-------------|
| of stuck-bit locations for |        |        |        | & SBU  |             |
| each error pattern         |        |        |        |        |             |
| type 1, 4/128              | 4.17%  | 1.66%  | 12.49% | 78.72% | 2.96%       |
| type 2, 8/128              | 90.25% | 1.58%  | 8.17%  | 0      | 0           |
| type 3, 116/128            | 9.59%  | 2.34%  | 80.14% | 8.29%  | $\approx 0$ |
|                            |        |        |        |        |             |

resembles the DCUs observed in irradiation experiments in Table IV. For 2\*SBU cases, the two SBUs are those originally injected, indicating that the on-die ECC did not intervene. This phenomenon can be understood by examining the bit map in Fig. 11. While SBUs are typically detectable and correctable, a double-bit upset might cause the Hamming code to incorrectly identify the error address, potentially leading to a third-bit error [2], [30]. Besides, in Fig. 11, the eight ECC check bits are designated to protect only the 128-bit data block, leaving the remaining 88 bits as non-data areas (Na regions). Therefore, occurrences of 3\*SBU or MBU(2)&SBU are observed only if the third-bit error falls within the 128-bit data region. If the third-bit error is mapped to either the ECC or Na region, it does not affect the 128-bit data, corresponding to 2\*SBU.

In addition, in the adjacent case, a 3\*SBU pattern exhibits a significant likelihood of 45.32%, as indicated in Table VI. However, this pattern was not identified in experimental observations, which might be due to three factors. First, the physical adjacency assumption may not be valid. Second, the limited quantity of errors is detected (for instance, only three DCUs are noted in Table IV), suggesting that the 3\*SBU pattern might not have occurred yet. Third, the simulations utilized a publicly available Hamming code, which may perform differently from the actual H matrix utilized in the DUTs. Furthermore, random injection results indicate that 2-bit upsets within the 128-bit data do not result in a single SBU occurrence. Nonetheless, such incidents were observed in experiments. This discrepancy leads us to speculate that factors other than the 2-bit upsets are responsible for the observed SBUs.

2) One Stuck Bit and One Random Upset in the 128-bit Data: DDR5 modules may exhibit a greater susceptibility to stuck bits, as explored in Section IV-B. The presence of a stuck bit within the cells could act in conjunction with other SBUs, effectively creating a scenario akin to a double-bit upset. Table VII presents a summary of the error types resulting from the introduction of one stuck bit and one random upset. This fault injection is similar to the approach used for two random upsets; however, in this instance, each bit undergoes evaluation as a potential stuck bit with spatial consideration. For each stuck bit, the number of random injections was determined to be  $1 \times 10^5$ . An examination of the 128 potential stuck bits yielded three distinct distributions of error types.

 4/128: Four stuck-bit locations can induce 2\*SBU, MBU(2), 3\*SBU, MBU(2)&SBU, and MBU(3) after injecting another random upset.

| TABLE VIII                                       |
|--------------------------------------------------|
| INJECTIONS OF TWO RANDOM (OR ADJACENT) UPSETS IN |
| EIGHT ECC CHECK BITS                             |

| Injection cases            | SBU    | No error |
|----------------------------|--------|----------|
| Random                     | 89.24% | 10.76%   |
| Adjacent (e.g., CB1 & CB2) | 87.48% | 12.52%   |

TABLE IX INJECTIONS OF ONE RANDOM UPSET IN THE 128-bit DATA AND ONE RANDOM UPSET IN THE EIGHT ECC CHECK BITS

| Injection cases | SBU    | 2*SBU  | MBU(2) |
|-----------------|--------|--------|--------|
| Random          | 18.75% | 50.26% | 30.99% |

- 8/128: Eight stuck-bit locations can induce 2\*SBU, MBU(2), and 3\*SBU after injecting another random upset.
- *116/128:* 116 stuck-bit locations can induce 2\*SBU, MBU(2), 3\*SBU, MBU(2) &SBU, and a few instances of MBU(3) after injecting another random upset.

The result suggests that when a stuck bit arises in the 128-bit data, it will result in multiple error patterns after injecting another random upset. It is obvious that the distribution of error patterns can vary based on the location of the stuck bit, indicating that the position of the stuck bit plays a crucial role in influencing the type of errors observed in the word. For example, if the stuck bit happened in the 8 bits categorized as type 2, it significantly increases the likelihood of generating a DCU, represented as 2\*SBU, when combined with another random upset. This probability can reach as high as 90.25%.

3) Two Random (or Adjacent) Upsets in ECC Check Bits: Radiation-induced upsets can also occur in the eight ECC check bits. Table VIII summarizes the results after injecting a 2-bit upset into the eight ECC check bits, considering both random and adjacent cases, where adjacency is defined by the ECC bit number. The results for random and adjacent cases are similar. SBU is the only error type observed, with corresponding rates of 89.24% and 87.48%, respectively. One 2-bit upset in the ECC region can lead to a misinterpretation of the error address, causing it to be incorrectly located in the 128-bit data region, resulting in one SBU or in other regions, resulting in no error. Consequently, SBU is the only error pattern observed. Furthermore, the likelihood of misinterpreted addresses occurring in the 128-bit data region is significantly higher, which accounts for the high percentage of SBU errors and is determined by the Hamming code.

4) One Random Upset in the 128-bit Data and One Random Upset in ECC Check Bits: Table IX summarizes the results after randomly injecting 1-bit upset into both the 128-bit data and the ECC check bits. In this scenario, there are three error types, including SBU of 18.75%, 2\*SBU of 50.26%, and MBU(2) of 30.99%. Besides, the probability of this injection scenario occurring in a practical situation is thought to be low due to two reasons. First, the ECC bits account for a lower percentage than the data; thus, the probability of upsets simultaneously occurring in both is low. More importantly, upsets are usually physically adjacent, whereas ECC bits and data may be stored separately inside the chip.

SUMMARY OF DIFFERENCES IN PROTON-INDUCED RADIATION EFFECTS BETWEEN DDR4 AND DDR5 (DATA FROM CC-DDR5-A, CC-DDR5-K, SC-DDR5-K, AND SC-DDR4-K)

| Category                         | Main differences                       |
|----------------------------------|----------------------------------------|
| DRAM_SEE patterns                | 1. Only DDR5 presents DCUs, MBUs       |
|                                  | 2. DDR5 also presents a few SBUs       |
|                                  | even with on-die ECC                   |
| $\sigma_{ m DRAM\_SEE}$          | DDR5 presents lower $\sigma$ (only for |
| _                                | server-grade modules)                  |
| sys_SEFIs with external ECC      | DDR5 presents a higher $\sigma$        |
| sys_SEFIs without external ECC   | No distinguish difference              |
| Tolerance of accumulated irradi- | DDR5 presents much better tolerance    |
| ations                           |                                        |

As shown in Table IV, SBU errors were detected in the experiments. Based on the above fault injection simulations, we can find that SBUs occur in the case of introducing upsets in the ECC region but not only in the 128-bit data. This implies that the experimental detection of SBUs is likely linked to the on-die ECC check bits.

## VI. SUMMARY OF THE RESULT AND FUTURE WORK

Table X summarizes the main differences in proton-induced radiation effects between DDR4 and DDR5. DDR5 presents different DRAM\_SEE error patterns. CU is the main DRAM\_SEE error type for DDR5. In particular, only DDR5 presents DCUs and MBUs. It still has a few SBUs even with the on-die ECC. DDR5 presents a higher  $\sigma_{sys_SEFI}$  in the enabled external ECC case and a much better tolerance to accumulated irradiations.

On-die ECC is a new feature of DDR5. By combining the experimental results with simulations, it is evident that proton-induced MBUs originate from the decoder for the on-die ECC, where the input with multiple SBUs can be transformed into an MBU. This result suggests that optimizing the physical layout of cell arrays and the H matrix could greatly reduce the probability of MBU occurrences after decoding, like the adjacent case in Table VI. Furthermore, remaining multiple SBUs in different bytes can be fully corrected by introducing the external ECC, enhancing the overall reliability of modules. Conversely, different layouts of cell arrays or Hmatrix could change the likelihood of MBU in a single byte, like the random case in Table VI. Those MBUs cannot be corrected by the external ECC, potentially leading to severe SEFIs. Therefore, careful consideration must be given to the layout of cell arrays and the H matrix.

In addition, we should pay more attention to decoders and peripheral circuits, as well as the PMU. Experimental results indicated that those circuits can trigger significant sys\_SEFIs and DRAM\_SEFIs. With regards to the server-grade modules, it appears that some hardening designs might have been implemented for decoders or peripheral circuits because of the fewer sys\_SEFIs in the disabled ECC case. Server-grade modules may need to consider the potential failures caused by the PMU. Further research is still required to understand the failure mechanisms of the PMU and explore radiationhardened-by-design (RHBD) methods.

## VII. CONCLUSION

This work studied the proton-induced radiation effects on DDR5 modules, including both SEEs and accumulated radiation effects. The energy sweep results of consumer-grade modules showed that under backside irradiation, the  $\sigma$  peak of both DRAM\_SEE and sys\_SEFIs occurred near 25 MeV. Two different vendors exhibited a similar sensitivity to protons. These modules suffered from two special types of SEUs, including SBUs and DCUs, as well as a large number of CUs. Both burstiness and continuity of CUs indicate sensitivity in the decoders and peripheral circuits. The PMU also demonstrated sensitivity, leading to severe system SEFIs. Based on the comparison results between server-grade DDR4 and DDR5, it was found that the  $\sigma_{\text{sys SEFI}}$  was significantly increased in the enabled ECC case only for DDR5, which mainly relates to its higher possibility of MBU patterns. Moreover, DDR5 presented much better resistance to accumulated radiation effects than DDR4 thanks to the on-die ECC. Finally, using a general Hamming code, the potential mechanisms of the observed SBUs and DCUs were revealed by fault injections. Simulation results suggested that the observed SBUs may be mainly caused by a 2-bit upset in the on-die ECC region, while DCUs may mainly originate from a physically adjacent 2-bit upset in the data protected by the on-die ECC.

## REFERENCES

- S. Lee et al., "Development and product reliability characterization of advanced high speed 14nm DDR5 DRAM with on-die ECC," in *Proc. IEEE Int. Rel. Phys. Symp. (IRPS)*, Mar. 2023, pp. 1050–1053.
- [2] JESD79-5B. DDR5 SDRAM. Accessed: Nov. 15, 2023. [Online]. Available: https://www.jedec.org/standards-documents/docs/jesd79-5b
- [3] SAMSUNG. DDR5, DRAM, Samsung Semiconductor Global. Accessed: Nov. 15, 2023. [Online]. Available: https://semiconductor.samsung. com/dram/ddr/ddr5/
- [4] P. Xiao-Yan and Y. Qiang, "Overview of direct measurements of cosmic rays," *Chin. Astron. Astrophys.*, vol. 43, no. 3, pp. 327–341, Jul. 2019.
- [5] E. H. Cannon et al., "Heavy ion, high-energy, and low-energy proton SEE sensitivity of 90-nm RHBD SRAMs," *IEEE Trans. Nucl. Sci.*, vol. 57, no. 6, pp. 3493–3499, Dec. 2010.
- [6] A. Rodriguez et al., "Proton-induced single-event degradation in SDRAMs," *IEEE Trans. Nucl. Sci.*, vol. 63, no. 4, pp. 2115–2121, Aug. 2016.
- [7] C. Lim, K. Park, and S. Baeg, "Active precharge hammering to monitor displacement damage using high-energy protons in 3x-nm SDRAM," *IEEE Trans. Nucl. Sci.*, vol. 64, no. 2, pp. 859–866, Feb. 2017.
- [8] M. Herrmann, K. Grurmann, F. Gliem, H. Schmidt, and V. Ferlet-Cavrois, "In-situ TID test of 4-Gbit DDR3 SDRAM devices," in *Proc. IEEE Radiat. Effects Data Workshop (REDW)*, Jul. 2013, pp. 202–208.
- [9] S. M. Guertin and M. Amrbar, "Single event testing of SDRAM, DDR2 and DDR3 memories," in *Proc. IEEE Radiat. Effects Data Workshop* (*REDW*), Jul. 2016, pp. 205–211.
- [10] C. Lim et al., "Study of proton radiation effect to row hammer fault in DDR4 SDRAMs," *Microelectron. Rel.*, vol. 80, pp. 85–90, Jan. 2018.

- [11] G. Bak et al., "Logic soft error study with 800-MHz DDR3 SDRAMs in 3× nm using proton and neutron beams," in *Proc. IEEE Int. Rel. Phys. Symp.*, Apr. 2015, pp. 931–935.
- [12] A. L. Bosser, P. Kohler, A. Rodriguez, and P.-X. Wang, "Impact of particle radiation and temperature on the retention time of DDR4 SDRAM cells," *IEEE Trans. Nucl. Sci.*, vol. 70, no. 8, pp. 1878–1884, Aug. 2023.
- [13] D. Söderström et al., "Technology dependence of stuck bits and singleevent upsets in 110-, 72-, and 63-nm SDRAMs," *IEEE Trans. Nucl. Sci.*, vol. 70, no. 8, pp. 1861–1869, Aug. 2023.
- [14] PassMar Softw. Pty Ltd. MEMTEST86. Accessed: Nov. 15, 2023. [Online]. Available: https://www.memtest86.com/
- [15] S. Longofono, D. Kline, R. Melhem, and A. K. Jones, "Predicting and mitigating single-event upsets in DRAM using HOTH," *Microelectron. Rel.*, vol. 117, Feb. 2021, Art. no. 114024.
- [16] D. M. Fleetwood, "Total-ionizing-dose effects, border traps, and 1/f noise in emerging MOS technologies," *IEEE Trans. Nucl. Sci.*, vol. 67, no. 7, pp. 1216–1240, Jul. 2020.
- [17] C. Lim, H. S. Jeong, G. Bak, S. Baeg, S.-J. Wen, and R. Wong, "Stuck bits study in DDR3 SDRAMs using 45-MeV proton beam," *IEEE Trans. Nucl. Sci.*, vol. 62, no. 2, pp. 520–526, Apr. 2015.
- [18] Y. Kim et al., "Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors," ACM SIGARCH Comput. Archit. News, vol. 42, no. 3, pp. 361–372, Jun. 2014.
- [19] K. Park, C. Lim, D. Yun, and S. Baeg, "Experiments and root cause analysis for active-precharge hammering fault in DDR3 SDRAM under 3 × nm technology," *Microelectron. Rel.*, vol. 57, pp. 39–46, Feb. 2016.
- [20] (2008). SRIM. Accessed: Nov. 15, 2023. [Online]. Available: http://www.srim.org/
- [21] M. Park et al., "Soft error study on DDR4 SDRAMs using a 480 MeV proton beam," in *Proc. IEEE Int. Rel. Phys. Symp. (IRPS)*, Apr. 2017, pp. 865–870.
- [22] R. Koga, J. George, and S. Bielat, "Single event effects sensitivity of DDR3 SDRAMs to protons and heavy ions," in *Proc. IEEE Radiat. Effects Data Workshop*, Jul. 2012, pp. 95–102.
- [23] S. Agostinelli et al., "GEANT4—A simulation toolkit," Nucl. Instrum. Methods Phys. Res. A, Accel. Spectrom. Detect. Assoc. Equip., vol. 506, no. 3, pp. 250–303, Jul. 2003.
- [24] P. Caron, C. Inguimbert, L. Artola, F. Bezerra, and R. Ecoffet, "New SEU modeling method for calibrating target system to multiple radiation particles," *IEEE Trans. Nucl. Sci.*, vol. 67, no. 1, pp. 44–49, Jan. 2020.
- [25] J. Choe. Industry-Leading DDR5 Technology: Micron Vs. Samsung Vs. SK Hynix. Accessed: Nov. 15, 2023. [Online]. Available: https://www.techinsights.com/blog/industry-leading-DDR5-technology
- [26] V. Sankaranarayanan. DDR5 SDRAM Error Correction Code (ECC) in DDR Memories. Accessed: Nov. 15, 2023. [Online]. Available: https://www.synopsys.com/designware-ip/technical-bulletin/errorcorrection-code-ddr.html
- [27] D. Carrano. Hamming Error Correction Codes (SECDED). Accessed: Nov. 15, 2023. [Online]. Available: https://github.com/ dominiccarrano/hamming
- [28] H. J. Tausch, "Simplified birthday statistics and Hamming EDAC," *IEEE Trans. Nucl. Sci.*, vol. 56, no. 2, pp. 474–478, Apr. 2009.
- [29] Tech. Insights Inc. DRAM Memory Technology. Accessed: Nov. 15, 2023. [Online]. Available: https://www.techinsights.com/ blog/memory/micron-1a-dram-technology
- [30] K. Criss et al., "Improving memory reliability by bounding DRAM faults: DDR5 improved reliability features," in *Proc. Int. Symp. Memory Syst.*, Sep. 2020, pp. 317–322.
- [31] M. Patel, J. S. Kim, H. Hassan, and O. Mutlu, "Understanding and modeling on-die error correction in modern DRAM: An experimental study using real devices," in *Proc. 49th Annu. IEEE/IFIP Int. Conf. Dependable Syst. Netw. (DSN)*, Jun. 2019, pp. 13–25.