Neutron Induced Single Event in Electrical Devices and Components F - 1

# State-of-the-Art Study on Mitigation Techniques of Single Event Effects in Terrestrial Applications

Eishi Ibe<sup>\*1</sup>, Ken-ichi Shimbo<sup>1</sup>, Tadanobu Toba<sup>1</sup>, Hitoshi Taniguchi<sup>1</sup>, Takumi Uezono<sup>1</sup>, Koji Nishii<sup>2</sup>, and Yoshio Taniguchi<sup>3</sup>

1Yokohama Research Laboratory, Hitachi, Ltd.

292 Yoshida, Totsuka, YokohamaKanagawa, 244-0817 Japan 2Telecommunication & Network System Division, Hitachi, Ltd.

Yokohama, Kanagawa, 244-8567 Japan

3Corporate Quality Assurance Division, Hitachi, Ltd.

Chiyoda, Tokyo, 101-8608 Japan

\*E-mail: hidefumi.ibe.hf@hitachi.com

Keywords: terrestrial radiation, fault, soft-error, failure, network, stack layer, mitigation, DOUB, LABIR, TMR, MCBI, bipolar action

#### Abstract

Abstract—As semiconductor device scaling is on-going far below 100nm design rule, terrestrial neutron-induced soft-error typically in CMOS devices is predicted to be worsen furthermore. Moreover, novel failure modes that may be more serious than those in memory soft-error are recently being reported. Therefore, necessity of implementing mitigation techniques is rapidly growing at the design phase, together with development of advanced detection and quantification techniques. The most advanced such techniques are reviewed and discussed.

## I. INTRODUCTION

Scaling down of semiconductor devices to sub-100nm technology encounters a wide variety of technical challenges like  $V_{th}$  variation [1], Negative Bias Temperature Instability (NBTI)[2], short-channel effect[3], gate leakage[4] and so on. Terrestrial neutron-induced single event upset (SEU) is one of such key issues that can be a major setback in scaling. As scaling proceeds below 130nm, a number of new error modes are found to be emerging. Such errors, in principle, are originated from faults or charges produced in dual or triple well regions in CMOS devices. Fault does not always cause error, depending mainly on the location and the amount of charge collected to an active node. Similarly, error does not cause always a system failure, depending on a number of masking effects in the stack layers of manufacturing processes as illustrated in Fig.1. Some failures may be fatal when they take place in the *real-time system* like avionics control system and anti-lock brake in automobiles [5]. Some other failures are not necessarily taken care of as in entertainment applications. Soft-Error Rate (SER) has been regarded as one of major metrics in reliability of electronic devices and systems, but fatality /significance of failures must be considered in designing electronic systems since we have a number of error modes in electronic systems these days.

It is generally accepted from the very beginning of terrestrial neutron soft-error issues that mitigation techniques applied to only single stack layer cannot be effective and promising solution against system failures and collaboration among stack layers has been encouraged [6,7]. In reality, such collaboration is very difficult. It may be recognized that most engineers/researcher cannot expand their specialties beyond their stack layers. Novel strategies to overcome this situation are needed to be explored and being proposed. *Built-in* communication scheme among the stack layers is proposed by Ibe, *et al.* in their LABIR (inter Layer Built-In Reliability) concept [8]. Evans *et al.* are proposing the RIIF (Reliability Information Interchange Format) as common format or protocol to be used in system design among stack layers [9].

At present, a number of SEE (Single Event Effects) prediction/detection/prevention/recovery techniques have been proposed in each hierarchy or stack layer. Such techniques are overviewed in the present paper to explore overall mitigation techniques in electronic systems against terrestrial radiation induced system failures.

148



In addition, it is widely recognized that high-energy neutron is not unique source of terrestrial soft-error. Low energy neutron including thermal neutrons[10], protons[11], muons[12] and even electrons and photons could cause terrestrial soft-error as they are substantially present in terrestrial field as shown in Fig.2 [13]. The novel strategies must cover such global areas.

## 2.1 Fault modes

## II. FALTY MODES IN EACH HIERARCHY

Figure 3 illustrates basic CMOS well structure. N-wells and p-wells are aligned in stripe pattern above p-substrate, and memory and logic devices are manufactured on the same well structure. As typically shown in Fig.4, charge collection mechanism take place when a charged particle pass through the storage node or bipolar action talks place when a charged particle pass through the p-n junction between the p-well and n-well. These phenomena in the well cause faults that may cause error in memory cells. The faults modes are summarized in Table 1 including stack-at fault and EMI (Electro-Magnetic Interaction).

1.1.

| Table 1 Fault modes and their property                                                                                                                                   |                                                                                                                        |                  |                                                                                                                                                                                                                                                 |                                                                                 |                                         |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------|-----------------------------------------|--|
| Class                                                                                                                                                                    | Definition                                                                                                             | Name             | Characteristics                                                                                                                                                                                                                                 | In-situ detection<br>method                                                     | In-situ<br>recover/mitigation<br>method |  |
| Transient/<br>noise                                                                                                                                                      | Transient in electric<br>potential and/or current<br>in a chip                                                         | SET <sup>1</sup> | Single transient due to charge<br>collected to the diffusion layer in<br>the chip. Pulse width is below a<br>few nano second, and can long<br>more than two clock pulses.                                                                       | Time and/or space<br>redundanbcy such as<br>DMR <sup>4</sup> , TMR <sup>5</sup> | None                                    |  |
|                                                                                                                                                                          |                                                                                                                        | MNT <sup>2</sup> | Simultaneous SETs in more than<br>two diffusion layers. Mainly,<br>MNTs take place in a single well<br>due to charge sharing or bipolar<br>action. Space redundancy<br>techniques such as DICE <sup>6</sup> , TMR<br>may not work against MNTs. | Monitoring the well<br>potential and/or<br>current                              | None                                    |  |
|                                                                                                                                                                          |                                                                                                                        | EMI <sup>3</sup> | Electromagnetic noise including<br>burst noise                                                                                                                                                                                                  | Electro-magnetic<br>probe                                                       | None                                    |  |
| Defect                                                                                                                                                                   | Lattice defects and or<br>trap level in the oxides.<br>They may cause leakage<br>current and may<br>disappear in time. | Vth shift        | Cause of Vth shift in flash<br>memory. They may cause stack<br>at "0/1" error and can be<br>permanent error.                                                                                                                                    | Vth measurement                                                                 | Annealing may<br>work                   |  |
| 1:Single Event Transient, 2:Multi-Node Transient, 3: Electro-Magnetic Interference, 4: DoubleModule Redundancy, 5: Triple Module Redundancy, 6: Double Interconnect CEII |                                                                                                                        |                  |                                                                                                                                                                                                                                                 |                                                                                 |                                         |  |

/n 11 1 E

1.

When fault take place in logic part, it is called as SET (Single Event Transient) that can cause SEU when the fault is captured in a memory element like a FF(Flip-Flop). Simple methods like parity in memory word are not effective to detect SET in logic circuit. Space redundancy techniques such as DMR(Double Module Redundancy) or TMR (Triple Module Redundancy) can be applied to detect SET, but they have power and area penalties. Even if MNT (Multi-Node Transient) take place in the redundant nodes, the transient cannot be detected and may cause SDC (Silent Data Corruption).



Fig. 3 Typical structure of CMOS dual/triple well and formation of a SRAM and an OR gate on the well



Fig. 4 Typical mechanisms of fault evolution

### 2.2 Error modes

Table 2 summarizes various error modes. Error modes can be classified into roughly three classes, such as soft-error or SEU(Single Event Upset), pseudo-hard error, and hard/permanent error. Soft error includes SBU(Single Bit Upset), MCU(Multi-Cell Upset), MBU (Multi-Bit Upset, MCU in the same word), MCBI (Multi-Coupled Bipolar Interaction) in memory element[14]. Direct hit on an FF by a charged particle may cause an SEU. They can be recovered by re-writing.

| Class                                    | Definition                                                                                                  | Mode name         | Characteristics                                                                                                                                                                                          | In-situ Detection              | In-situ recover<br>method |  |
|------------------------------------------|-------------------------------------------------------------------------------------------------------------|-------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------|---------------------------|--|
| SEU <sup>1</sup> ,Soft-error             | Data cjhange in<br>meomory elements<br>such as SRAMs,<br>Flip Flops by a<br>single particle hit<br>(event). | SBU <sup>2</sup>  | Single bit error for one event.                                                                                                                                                                          | Parity、ECC <sup>16</sup>       | ECC                       |  |
|                                          |                                                                                                             | MC U <sup>3</sup> | More than two bits fail in one event. Data in multiple FFs may be flipped by SET in clock line or SET/RESET lines.                                                                                       |                                | Interleave + ECC          |  |
|                                          |                                                                                                             | MBU <sup>4</sup>  | MCU in the same word. They cannoy be corrected normal ECC.                                                                                                                                               | Upper grade ECC                | Upper grade ECC           |  |
|                                          |                                                                                                             | MCBl⁵             | More than two bits fail locally due to potential disturbance in well by bipolar action.                                                                                                                  | Current /<br>potential monitor | Interleave + ECC          |  |
| Pseudo hard-<br>error,PCSE <sup>13</sup> | Error that cannot be<br>re-written They can<br>mostly be activated<br>by power cycling.                     | FBE <sup>6</sup>  | Main error mode in SOI <sup>12.</sup> Body-tie may suppress this mode.                                                                                                                                   | Parity, ECC                    | Power cycle               |  |
|                                          |                                                                                                             | SEL <sup>7</sup>  | Re-writing does not work. Current continue to flow by<br>parasitic cylyster effect. Power cycle can be applied to<br>activate the chip.                                                                  | Current /<br>potential monitor |                           |  |
|                                          |                                                                                                             | SEFI <sup>8</sup> | All in one definition of functional anormalities in logic<br>circuits. Power ctcle or resetting ffs can activate the<br>chip.SEFI in decoder in peripheral circuit of memory<br>may cause wrong address. | FF parity/ECC                  |                           |  |
|                                          |                                                                                                             | Firm Error        | Error in configuration memory in SRAM based FPGA <sup>14</sup> .                                                                                                                                         | CRC <sup>17</sup>              | Partial reconfiguration   |  |
| Hard<br>Error/Permanent<br>error         | Destructive and<br>permanent error                                                                          | SEGR <sup>9</sup> | Distruction of gate oxide in power devices mainly due to<br>heaw ions. Flash memory can be failed by ythis mode<br>as scaling extremely proceeds.                                                        | Anormalities in<br>parts       | Loading stand-by system   |  |
|                                          |                                                                                                             | SEB <sup>10</sup> | Distructive mode in power MOSFET such as IGBT <sup>15</sup> .<br>SEB may take placer in IGBTs for trains and<br>automobiles.                                                                             | Anormalities in<br>parts       | Loading stand-by system   |  |

Table 2 Error modes of single event effects in semiconductor devices

1:Sigle Event Upset, 2:Single Bit Upset, 3:Multicell upset,4:Multi-bit upset,5:Multi-Coupled Bipolar Interaction,6:Floating Body Effect, 7:Single Event Latchup, 8:Sigle Event Functional Interrupt, 9:Single Event Gate Rupture,10:Single Event Burnout, 11:Flip Flop, 12:System On Insulator, 13:Power Cycle Soft Error, 14: Field Programmable Gate Array, 15: Insulated Gate Bipolar Transistor, 16:Error Correction Code, 17: Cyric Redundancy Check In particular, MCUs have been under close scrutiny and their ratio to the total SEU are drastically increasing [15-19]. Though MBUs can be avoided by a combination of ECC and the interleaving technique [19], MCUs that can be corrected by EDAC/ECC can still be problematic in high performance devices such as contents addressable memories (CAMs) [20] or registers used in network processors and routers. In the case of system design, it is therefore very important to evaluate MCUs as well as soft-error rates (SERs) of the device in design phase.

Pseudo hard-error cannot be recovered by re-writing but can be recovered by resetting FFs or power cycle.

Hard/permanent error cannot be recovered by any software and may cause fatal failure. Replacement or isolation of corrupted parts is only possible method to continue to use the system.

### 2.3 Failure modes

Failure is defined as observable faulty condition in an electronic system, which requires actions for solution. Faults and errors can be masked sometimes without any countermeasures. To establish solution, the root cause or physical mechanisms must be identified. Classification of failures is often applied to identify the root cause or root parts/chips in a system board.[21,22]

Table 3 shows an example of such classification of failures based on the fatality of the failure with two key factors such as latency in operation and duration for recovery. When SDCs take place in a large-scale super computer, simulation may give wrong results without any latency. This type of failure is called SLFL (SiLent FaiLure). If the SDCs suffer convergence of matrix calculation or frequent rollback due to error detection, significant time loss may take place in the computer system. We call it LTFL(LaTency FaiLure). If the system requires short-range outage to recover, we can the failure LHFL (Light Halt Failure). If the system requires long-range outage, we call the failure HHFL(Heavy Halt FaiLure). It the system is un-recoverable, we call the failure FTFL (FaTal FaiLure).

| Class                            | Definition                                                                                                           | Mode<br>name                                                                                | Characteristics                                                                                                                                                                        | In-situ Detection                                      | In-situ recover method                                            |
|----------------------------------|----------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|-------------------------------------------------------------------|
| None-latency<br>failure          | Silent data corruption in data<br>or address that cause wrong<br>simulation results by super<br>computer             | nt data corruption in data<br>ddress that cause wrong<br>ulation results by super<br>nputer |                                                                                                                                                                                        | None                                                   | (If fault level detection<br>works)<br>Checkpointing+<br>Rollback |
| Latency failure                  | Performance of the electronic<br>system is lowered due to<br>over frequent rollback, for<br>example.                 | LTFL <sup>2</sup>                                                                           | In case time redundancy<br>techniques are applied, this mode<br>is commonly take place. Typically,<br>rollback after fault detection by<br>using double module redundancy<br>technique | DMR <sup>7</sup> error flag, the<br>number of retries. | Reboot                                                            |
| Light halt dailure               | Electronic system can be<br>recovered by short-time<br>operation.                                                    | LHFL <sup>3</sup>                                                                           | MBU, MNT、SEL can be the cause.                                                                                                                                                         | ECC、<br>current/potential<br>monitor                   | Reboot/power cycling                                              |
| Heavy halt failure               | Electronic system can be<br>recovered by longt-time<br>operation. Some logs and<br>data may be lost.                 | HHFL <sup>4</sup>                                                                           | Error in the configuration memory in<br>FPGA, SEL can cause this mode.                                                                                                                 | CRC check                                              | Partial reconfiguration/<br>power cycling                         |
| Unrecoverable<br>(Fatal) failure | Distruction of power supply<br>and/or power device. Power<br>supply or overall electronic<br>system maybe exchanged. | FTFL⁵                                                                                       | Distruction of IGBT, DC-DC converters due to SEB.                                                                                                                                      | System down                                            | None                                                              |

| Table 3 Exam | ple for c | lassification | of failure | modes |
|--------------|-----------|---------------|------------|-------|
|--------------|-----------|---------------|------------|-------|

1: Silent Failure, 2: Latency Failure, 3: Light Halt Failure, 4: Heavy Halt Failure, 5: Fatal Failure, 6: Silent Data Corruption, 7: Double Module Redundancy

## III. VISUALIZATION AND MITIGATION OF SEE

In order to establish overall mitigation techniques in electronic systems, integration of four key technologies such as prediction, detection, prevention and in-situ/off-line recovery techniques is needed. Table 4 summarizes such techniques along with two axes, stack layers and the four key technologies.

|                |                                                                                                                    |                                                                                                                      | U                                                                                     |                                                                                                                            |                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                                                                               |
|----------------|--------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------|
| Layer          | Prediction/Estimation                                                                                              | Prevention                                                                                                           | In-situ detection                                                                     | In-situ recovery                                                                                                           | Off-line recovery | [r1] A. Evans, et al.(2012)<br>[r2] K. Shimbo, et al.(2011)<br>[r3] P. Roche(2010)<br>[r4] T. Takata, et al.(2010)<br>[r5] T. Talkata, et al.(2011)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |                                                                               |
| Application    | Simulation based fault<br>injection[r1]                                                                            | Probabilitistic<br>calculation[r9]                                                                                   | Anormally operation (e.g. SWAT[r18])                                                  | Checkpointing-<br>Rollback[r29]                                                                                            |                   | [r6] E. lbe, et al. (2001)<br>[r7] H. Yamaguchi, et al.()                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |                                                                               |
| OS             | •Log analysis                                                                                                      |                                                                                                                      | detection mechanism in<br>the kernel[r19]                                             |                                                                                                                            | •Reboot           | [r9] E. Ibe, et al(2006)<br>[r9] R. Kumar(2011)<br>[r10] E.Ibe, et al.(2011)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | [r9] R. Kumar(2011)<br>[r10] E.lbe, et al.(2011)<br>[r10] E.lbe, et al.(2011) |
| PCB            | • Full/partial board irradition<br>[r2]                                                                            | •DOUB(Design On Upper<br>Bound)[r10]                                                                                 | •Watch-dog timer[r20]                                                                 | <ul> <li>LABIR (inter-LAyer</li> <li>Built-In Reliability)[r30]</li> <li>Cross-Layer Reliability</li> <li>[r31]</li> </ul> | •Reboot [r35]     | [r11] I. Cain, et al.(1996)<br>[r12] N. Seifert, et al.2008()<br>[r13] S. Mitra, et al.(2007)<br>[r14] T. Uemura, et al.(2010)<br>[r15] HH. Lee, et al.(2010)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                                               |
| Chip/Processor | Simulation based fault<br>injection     Emulation based fault<br>injection[r3]<br>Irradiation test<br>log analysis |                                                                                                                      | •DMR (Double Module<br>Redundancy)[r21]<br>•On-chip monitor<br>[r22,r23]<br>•CRC[r24] | •TMR (Triple Module<br>Redundancy)[r32]<br>•Chekpointing-Rollback<br>•Partial reconfiguration<br>[r33]                     | • Reboot          | <ul> <li>[15] FF. Lee, et al.(2010)</li> <li>[16] J. Furuta, et al.(2010)</li> <li>[17] D. Ernst, et al.(2003)</li> <li>[18] M. Li, et al.(2003)</li> <li>[19] A. Pellegrini, et al.(2011)</li> <li>[120] P.C. Monferrer, et al.(2007)</li> <li>[121] K. Noguchi, et al.(2017)</li> <li>[122] K. Noguchi, et al.(2007)</li> <li>[123] K. Yoshikawa, et al.(2011)</li> <li>[124] SJ. Wen, et al.(2008)</li> <li>[125] T. Uemura, private communication (2012)</li> <li>[126] A. Sanyal, et al.(2010)</li> <li>[127] T. Wang, et al.(2010)</li> <li>[128] S.A. Bota, et al.(2010)</li> <li>[128] S.A. Bota, et al.(2010)</li> <li>[129] D. Skalin, et al.(2009)</li> <li>[130] E. Ibe, et al.(2011)</li> <li>[131] J. Loncaric(2011)</li> <li>[132] H. Quinn, et al.(2007)</li> <li>[133] M. Abdelfattah (2012)</li> <li>[134] K.Z. Pekmestzi, et al.(2008)</li> <li>[135] K. Shimbo, et al.(2011)</li> </ul> |                                                                               |
| Circuit        | Circuit simulation     Logic masking simulation [r4,r5]     Irradiation test                                       | • Space/Time redundancy<br>(DICE[r11], SEUT[r12],<br>BISER[r13], SEILA[r14],<br>LEAP[r15]),BCDMR[r16],<br>RAZOR[r17] | Parity for FFs[r25]     BIST (Built-In Self Test) [r26]                               | BISR (Built-In Self<br>Repair)[r34]                                                                                        |                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                                                                               |
| Device         | • SEE Monte-Carlo<br>simulation [r6]<br>• TCAD Simulation[r7]<br>• Irradiation test                                | Addition of resistor and/or<br>capacitor     Confinement of charge<br>collection vollume     Gate sizing             | • ECC, parity                                                                         | • ECC (SBU only)<br>• Data mirroing                                                                                        |                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                                                                               |
| Substrate/well | •TCAD Simulation[r8]                                                                                               | Enhancement of migration     Optimization of well     structure/size                                                 | •BICS(Built-In Current<br>Sensor)[r27,r28]<br>•BIPS(Built-In Pulse<br>Sensor)         |                                                                                                                            |                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                                                                               |

Table 4 Visualization and mitigation techniques of SEE

The explanations on the following columns in Table 4 are skipped here because of space limitation.

3.1 Prediction/estimation techniques

3.2 Prevention Techniques

3.3 In-situ detection

3.4 In-situ recovery

#### 3.5 DOUB (Design On Upper Bound) and LABIR (inter- LAyer Bulit-In Reliability)

In the Sections 3.2 and 3.4, stack layer (device, circuit, module/processor) level prevention/recovery techniques are reviewed and no single mitigation technique seems to fulfill simultaneously the reliability and performance requirements with minimum penalties and reasonable costs.

The authors, therefore, are working on the different and novel approach named (v) Design on Upper Bound (DOUB) by which the upper-bound failure rate can be estimated explicitly.

The equation (A), for example, gives the maximum upper-bound of chip-level SET because the equation does not include any masking effects. By modifying the maximum upper-bound with various physical limits determined by device structure/layout, circuit complexity, structure of logical layers, the realistic upper-bound failure rate free from the variations may be obtained. Figure 7 shows an example of such cumulative upper bounds of fault rates for various radiation sources calculated from the spectra in Fig. 2. If this upper-bound of a chip is low enough, the chip can be ignored for further analysis. If the upper-bound is of concern, mitigation techniques are applied from a simple and low cost method in design phase as shown in like:

By using soft-error Monte-Carlo simulator CORIMS, the author also tried to calculate upper-bound fault rates for various terrestrial radiation, and obtain some important conclusions:

(i) Exchange of weak logic gate/memory to robust logic gate/memory. DRAM is currently very robust device and can be substitute of SRAMSs where speed is not critical [33,34].

(ii) Minimization of active memory area

(iii) Limited and local use of space or time redundancy techniques in the circuit level, and so on. The authors are proposing a novel method LABIR (Inter LAyer Built-In Reliability) as is illustrated in Fig.8. LABIR proposes interactive or communicative mitigation techniques in which a recovery action such as rollback to the checkpoint ignited when a layer finds any error symptom, not necessarily error or fault itself. BIST (Built –In Self Test) [47], Built-In Current (Pulse) Sensor (BICS[48], BIPS) can be used for such kind of technique for symptom detection. The symptom may

152

not appear so often so that power and area penalties can be minimized with minimum additional structure and circuits. By using BIPS, a pulse current propagated from an MCBI (Multi-Coupled Bipolar Interaction) zone in p-well can be detected in  $I_{dd}$  line as demonstrated in [14]. By capturing such a symptom by applying a sense amp between adjacent two p-wells, for example, errors or failures can be resumed by the rollback and replication operation in CPU level of the ULSI chip. Other sources of noises like EMI (Electro-Magnetic Interference) [35] propagate in wider area than soft-error over many wells so that they can be eliminated by the differential method between adjacent wells.

#### CONCLUSIONS

As semiconductor device scaling is on-going far below100nm design rule, terrestrial neutron-induced soft-error typically in SRAMs is predicted to be worsen furthermore.

Moreover, novel failure modes that may be more serious than those in memory soft-error are recently being reported. Therefore, necessity of implementing mitigation techniques with marginal penalties including power dissipation is rapidly growing at the design phase, together with development of advanced detection and quantification techniques. The most advanced such techniques are reviewed and discussed with proposal of novel mitigation strategies of the Design on Upper Bound (DOUB) and the inter LAyer Bulit-In Reliability (LABIR).



Fig. 7 Cumulative upper-bound fault rates due to various radiation sources at NYC sea level calculated by using CORIMS



Fig. 8 General design flow of stepwise reduction in SER under the design on upper bound concept. Power consumption, cost, and global warming are key issues.

#### REFERENCES

- [1] N. Sugii, R. Tsuchiya, T. Ishigaki, Y. Morita, H. Yoshimoto, K. Torii, and S. Kimura, *IEDM, San Francisco, Dec. 15-17*, pp. 249-253 (2008).
- [2] S. Wen, R. Wong, and A. Silburt, SELSE4, University of Texas at Austin, March, 26,27 (2008).
- [3] D. Villanueva, A. Pouydebasque, E. Robilliart, T. Skotnicki, E. Fuchs, and H. Jaoue, 2003 IEDM, Washington, DC, December 7 - 10, 2003, No.9.4 (2003).
- [4] L.T. Clark, K.C. Moh, K.E. Holbert, X. Yao, J. Knudsen, and H. Shah, TNS, Vol.54, No.6, pp. 2028-2036 (2007).
- [5] S. Hamdioui (Organizer), Special Session 4-Panel: "Reliability of Hard Real-time Systems in 32nm and Beyond: Who Will Solve the Challenges?," 2012 IEEE Int'l On-Line Testing Symposium (2012).
- [6] C. Slayman, 2003IRPS, SER Panel Discussion, Dallas, Texas, April 2, 2003, No.6 (2003).
- [7] H. Quinn, SELSE7, Champaign, Illinoi, March 29-30 (2011).
- [8] E. Ibe, K. Shimbo, T. Toba, H. Taniguchi, and Y. Taniguchi, "LABIR: Inter-LAyer Built-In Reliability for Electronic Components and Systems," SELSE7, Champaign, Illinoi, March 29-30 (2011).
- [9] A. Evans, M.Nicolaidis, S.-J. Wen, D. Alexandrescu, and E. Costenaro, *IOLTS 2012, Sitges, Spain, June 27-29, 2012*, No.6.2 (2012).

[10] S. Wen, R. Wong, M. Romain, and N. Tam, IRPS 2010, Anaheim, CA, May 2-6, 2010, No.SE5.1, pp. 1036-1039(2010)...

- [11] B.D. Sierawski, R.A. Reed, R.D. Schrimpf, R.A. Weller, M.H. Mendenhall, M.A. Xapsos, R.C. Baumann, and X. Deng, NSREC, Quebac, Canada, July 20-24, 2009, No.A-8 (2009)
- [12] B.D. Sierawski, M.H. Mendenhall, R.A. Reed, M.A. Clemens, R.A. Welle, R.D. Schrimp, E.W. Blackmore, M. Trinczek,

153

B. Hitti, J.A. Pellish, R.C. Baumann, S.-J. Wen, R. Wong, N. Tam, *IEEE Trans.Nucl. Sci*, Vol.57, No.6, pp. 3273-3278 (2010).

- [13] E. Ibe, T. Toba, K. Shimbo, and H. Taniguchi, IOLTS 2012, Sitges, Spain, June 27-29, 2012, No.3.2 (2012).
- [14] E. Ibe, S. Chung, S. Wen, H. Yamaguchi, Y. Yahagi, H. Kameyama, S. Yamamoto, and T. Akioka, 2006 CICC, San Jose, CA., September 10 - 13, 2006, pp. 437-444 (2006).
- [15] E. Ibe, S. Chung, S. Wen, Y. Yahagi, H. Kameyama, and S. Yamamoto, 2007 NSREC, Ponte Vedra Beach, Florida, July 17-21, 2006, No.PC-6 (2006).
- [16] E. Ibe, S. Chung, S. Wen, S., Y. Yahagi, H. Kameyama, S. Yamamoto, T. Akioka, and H. Yamaguchi, Workshop on Radiation Effects on Component and Systems (RADECS), Athens, Greece, September 27-29, 2006, No.D-2 (2006).
- [17] D. Radaelli, H. Puchner, P. Chia, S. Wong, and S. Daniel, 2005 NSREC, Seattle, Washington, July 11-15, 2005, No.F-4 (2005).
- [18] N. Seifert, and V. Zia, SELSE3, Austin Texas, April 3, 4 (2007).
- [19]T. Nakamura, M. Baba, E. Ibe, Y. Yahagi, and H. Kameyama, "Terrestrial Neutron-Induced Sift-Errors in Advanced Memory Devices," New Jersey, World Scientific (2008).
- [20] K. Pagiamtzis, N. Azizi, and F. Najm, 2006 CICC, San Jose, CA., September 10 13, 2006, pp. 301-304 (2006).
- [21] C. Lopez-Ongil, M. Portela-Garcia, M.G. Valderas, A. Vaskova, Entrena, J. Rivas-Abalo, L., A. Martin-Ortega, Oter, J. M, S.Rodriguez-Bustabad., and I. Arruego, *IOLTS 2012, Sitges, Spain, June 27-29, 2012*, No.9.4, pp. 188-193 (2012)
- [22] R. Baranowski, and H.-J. Wunderlich, IOLTS2011, Athens, Greece, July 13-15, 2011, No.13.3, pp. 278-283(2011).
- [23] P. Roche, IOLTS2010, Corfu Island, Greece, July 5-7, Keynote 1, p. xv (2010).
- [24] T. Takata, and Y. Matsunaga, SELSE 2011, Champaign, Illinoi, March 29-30, 2011
- [25] T. Makino, D. Kobayash, K. Hirose, D. Takahashi, S. Ishii, M. Kusano, S. Onoda, and T. Hirao, and T. Ohshima, *TNS2009*, Vol.56, No.6, pp. 3180-3184 (2009).
- [26] H. Nakamura, K. Tanaka, T. Uemura, K. Takeuchi, T. Fukuda, and S. Kumashiro, *IRPS, Anaheim, CA, May 2-6*, pp. 694-697 (2010).
- [27] E.H. Cannon, and M. Cabanas-Holmen, NSREC, Quebac, Canada, July 20-24, 2009,, No.PI-3 (2009).
- [28] JESD89A," JEDEC STANDARD, JEDEC Sold State Technology Association, No.89, pp. 1-85 (2006).
- [29] E. Ibe, K. Shimbo, T. Toba, Y. Taniguchi and H. Taniguchi, ICICDT2010, Grenoble, France, June, SER Session No.1 (2010)
- [30] E. Ibe, "Novel Features in SER Characteristics toward New Standards Special Session 1-Panel :SER standards: Where we are? What's next?," *IOLTS2010, Corfu Island, Greece, July 5-7* (2010).
- [31] D. Alexandrescu, R. Baumann, A. Bougerol, E. Ibe, S. Rezgui, and C. Slayman, "Special Session 1-Panel: SER standards: Where are we? What's next?," *IOLTS2010, July5-7, 2010, Corfu Island, Greece* (2010).
- [32] D.F. Heidel, K.P. Rodebell, P.W. Marshall, J.A. Pellish, K.A. LaBel, M.A. Xapsos, S.E. Hakey, M.C. Rauch, J.R. Schwank, P.E. Dodd, M.R. Shaneyfelt, M.D. Berg, M.R. Friendlich, A.D. Phan, and C.M. Seidleck, *NSREC, Quebac, Canada, July 20-24, 2009,*, No.I-9 (2009).
- [33] K. Shimbo, T. Toba, IEICE Technical report CPM2009-139, *Kochi, Dec. 2-4*, Vol.109, No.317,318, pp. 51-55 (2009)(In Japanese)
- [34] K. Shimbo, T. Toba, K. Nishii, E. Ibe, Y. Taniguchi, and Y. Yahagi, SELSE7, Champaign, Illinoi, March 29-30 (2011).
- [35] N. Kanekawa, E. Ibe, T. Suga, and Y. Uematsu, "Dependability in Electronic Systems-Mitigation of Hardware Failures, Soft Errors, and Electro-Magnetic Disturbances-," New York, Springer (2010).
- [36] L. Borucki, G. Schindlbeck, and C. Slayman, *IRPS 2008, Phoenix, Arizona, April 27-May 1, Phoenix Convention Center,* No.5A.4 (2008).
- [37] H. Ando, and S. Hatanaka, IEEE Workshop on Silicon Errors in Logic System Effects 3, Austin Texas, April 3, 4 (2007).
- [38] M. Yoshimura, Y. Akamine, and Y. Matsunaga, SELSE 2011, Champaign, Illinoi, March 29-30, 2011
- [39] T. Calin, M. Nicolaidis, and R. Velazco, TNS, Vol.43, No.6, pp.2874-2878 Dec.1996.
- [40] M. Cabanas-Holmen, E. H. Cannon, A. Kleinosowski, J. Ballast, J. Killens, and J. Socha, TNS, Vol.56, No.6, pp.3505-3510 (2009).
- [41] H.-H. K. Lee, K. Lilja, M. Bounasser, P. Relangi, I.R. Linscott, U.S. Inan, and S. Mitra , idem., pp. 203-212(2010).
- [42] S. Mitra, M. Zhang, N. Seifert, T. Mak, and K.S. Kim, *ICICDT2007, Austin, Texas, May 18-20, 2007, pp. 263-268* (2007).
- [43] J. Furuta, K. Kobayashi, and H. Onodera, IEICE Trans. on Electronics, Vol.E93-C, No.3, pp. 340-346 (2010).
- [44] J. Furuta, C. Hamanaka, K. Kobayashi, and H. Onodera, VLSIC, Honolulu, HI, USA, June 16-18, pp. 123-124 (2010).
- [45] D. Ernst, N.S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner1, and T. Mudge, MICRO-36, 2003
- [46] H. Quinn, K. Morgan, P. Graham, J. Krone, M. Caffrey, and K. Lundgren, NSREC, Honolulu, Hawaii, July 23-27, 2007, No.C-5 (2007)
- [47] G. Theodorou, N. Kranitis, A. Paschalis, and D. Gizopoulos, *IOLTS2010, Corfu Island, Greece, July 5-7, 2010, No.7.4*, pp. 159-164 (2010).
- [48] S.A. Bota, G. Torrens, B. Alorda, J. Verd, and J. Segura, *IOLTS2010, Corfu Island, Greece, July 5-7, 2010*, No.7.1, pp. 141-146 (2010).

Proceedings of 10th RASEDA, Tsukuba, Japan (2012).