A lot has been written about Moore's Law and the potential limitations of semiconductor design and manufacturing in the future. Can we really continue to double the number of transistors on a given-sized die every eighteen months? For the last 30 years, Moore's Law has held, but, we may be seeing the real limitations to future semiconductor development and to Moore's Law — severely shortened operational lifetimes of advanced chips. That, in turn, creates massive reliability problems for critical embedded systems in the future.
As we advanced our semiconductor development knowledge and manufacturing processes over the years, the reliability and operational lifetimes of those chips increased to such a level that all semiconductor makers dropped their 883B testing procedures and eliminated their military chip lines and designations. The chips coming off the production line were "good enough" for many critical systems applications, and the military market was too small to justify separate testing and product designations. While the commercial chips were typically rated for 0-50 degrees C operation, some critical applications needed reliable operation over extended temperature ranges (-40 to +125 degrees C). Those system builders took the commercial- grade devices, sent them out to testing labs, and found that a large majority of them would, in fact, operate reliably at those extended temperature ranges.
We did run into some problems in this micron to sub-micron shift. We could operate those small-geometry chips at much lower voltages, but they consumed much more current and produced more heat, requiring better cooling techniques. Leakage currents also became a severe problem, adding to the high current requirements and cooling issues. We all know that heat kills electronic components. The Arrhenius formulas tell us that for every 10 degrees C you decrease the temperature on your electronics, your MTBF (Mean Time Between Failure) will double. We typically used air to cool these devices. Air is a great insulator, but a terrible conductor of heat. With forced-air, you can remove about 1 to 1.5 watts of heat per square inch of PCB area. Sub-micron chips, particularly microprocessors, generate 25 watts of heat per square inch or more.
To maintain extended operational lifetimes of chips and increase reliability in critical systems, we had to develop new cooling techniques. At VITA, we created and standardized many new high-efficiency cooling schemes over the past 5 years for the sub-micron chips we were using. Liquid flow-through, cold walls, cold plates, direct spray cooling (spraying atomized coolant directly onto the chips), and hybrid conduction-liquid cooling methods are documented to remove up to 100 watts per square inch or more from these hot chips. These new cooling techniques pushed the MTBF out to 15-17 years again and overcame the shorter operational lifetimes that the new sub-micron chips of the time were exhibiting.
We also saw SEU's (Single Event Upsets) in sub-micron devices, caused by ions or electro-magnetic radiation striking the tiny elements of the chips and changing their state. In some instances, the ionization caused permanent damage to the device, especially in microprocessors, memories, and large power transistors. For many applications, especially in aerospace, such failures are unacceptable. New shielding methods, redundant systems, and radiation-hardening of chips used in aerospace resolved these problems of maintaining acceptable operational lifetimes and the reliability of the ICs.
Today, we are designing and manufacturing semiconductor devices with nm (nanometer) feature geometries. That's three magnitudes smaller than micron-based designs. And a new set of problems have risen with these chips that cause concern. Previous semiconductor generations were showing operational lifetimes of 10-15 years or more. Most CRT-based, semiconductor-driven TV sets operated for 14-16 years, although they were in air-conditioned homes. They were exposed to thousands of thermal cycles, voltage and current spikes on the power lines, and other environmental torture. However, empirical evidence is now showing that nm-geometry ICs are wearing-out in just 3-5 years.
A recent paper, "State of the Art Semiconductor Devices in Future Aerospace Systems"*, explores these disconcerting phenomena and identifies four basic failure modes that cause these nm-geometry chips to have very short operational lifetimes:
- TDDB (Time Dependent Dielectric Breakdown): The dielectric fails due to the voltage stress in the gate oxide as CMOS device feature sizes decrease.
- EM (Electromigration): The migration of the metal atoms in a conductor due to momentum exchange between conducting atoms and the metal atoms in the conductor.
- HCI (Hot Carrier Injection): The breakdown of the silicon barriers as channel lengths get shorter inside small-geometry devices.
- NBTI (Negative Bias Temperature Instability): Silicon dioxide-to-substrate compromise at elevated temperatures.
We have replaced the old aluminum conductor layers with copper in ICs, to mitigate the migration issues. But, the dielectric material is clearly a major part of the problem noted above. As the silicon dioxide gate dielectric goes below 2nm in thickness, leakage currents due to tunneling increase dramatically. That raises power consumption, increases heat generation, and reduces the reliability and operational life of the IC. Replacing silicon dioxide with high-K materials (materials with a higher dielectric constant like hafnium and zirconium) may resolve some of these failure modes, but for how long and at what cost? As we continue to shrink the feature sizes of semiconductor elements (90nm, 65nm, 45nm, 22nm, etc), does the high-K dielectric hold up over higher temperatures?
In 2006, for the first time in history, over 50% of all semiconductors shipped worldwide went into consumer electronics products. It is clear that the semiconductor makers are all focused on consumer product markets, which may accept shorter operational lifetimes. After all, cell phone users have accepted ver y poor quality of ser vice, dropped calls, dead zones, and other maladies that never occurred with landlines. Also, the makers of cell phones, MP3 players, DVD players, digital cameras, PCs, etc inspire the consumer to buy newer devices ever y 3-5 years to get advanced features. Is there a better way to inspire replacement of those devices than to have them just stop working in 2-3 years? I'm not suggesting a conspiracy theory here, but, I don't see a lot of incentive for semi makers to resolve these reliability and shorter operational lifetime problems. That doesn't bode well for critical application segments like aerospace and military semiconductor users.
Additionally, we have seen that the old methods of predicting reliability in electronics have begun to fail us. MIL Handbook 217 has been the cornerstone of reliability prediction for many years. That document was created decades ago, and has been updated a few times in the past 10 years (the ANSI/VITA 51.1-2008 specification provides a standardization of the inputs to MIL-HDBK-217F Notice 2 calculations to give more consistent results). But MIL-HDBK-217 is rapidly becoming somewhat irrelevant and unreliable as we venture into this new realm of nm-sized semiconductor features and their failure modes. Within VITA, as well as in some other organizations, a new reliability prediction model is taking shape: PoF (Physics of Failure). PoF analysis is a methodology of identifying and characterizing the physical processes and mechanisms that cause failures in electronic components. Computer models integrating deterministic formulas from physics and chemistry are the foundation of PoF and propose a completely new methodology for reliability prediction.
So, as we continue to shrink semiconductor features to the nanometer level, the inherent limit to Moore's Law may actually be the shorter operational lifetimes and decreased reliability of the devices, not the ability to create smaller and smaller features. The last thing we need today is an expensive semiconductor device that works for a short time, and then behaves like a fuse. At the 45nm level, the thinnest layers on the die are approaching a thickness of only 3 to 5 atoms. If an atom of the material we are using (which is approximately 2 Angstroms in diameter, or about two tenths of a nanometer) is out of place, it can potentially cause the device to fail. Either the materials engineers will figure this out, make it work reliably, and solve these problems, or we may finally be approaching the juncture where we have to abandon electrons, and start using photons for computing.