Last week, Paul Alcorn over at Tom’s Hardware picked up on an interesting statement made by Intel in their Q4 2016 earnings call. The company, whose Data Center group’s profits had slipped a bit year-over-year, was “observing a product quality issue in the fourth quarter with slightly higher expected failure rates under certain use and time constraints.” As a result the company had setup a reserve fund as part of their larger effort to deal with the issue, which would include a “minor” design (i.e. silicon) fix to permanently resolve the problem.
A bit more digging by Paul further turned up that the problem was with Intel’s Atom C2000 family, better known by the codenames Avoton and Rangeley. As a refresher, the Silvermont-based server SoCs were launched in Q3 of 2013 – about three and a half years ago – and are offered with 2, 4, and 8 cores. These chips are, in turn, meant for use in lower-power and reasonably highly threaded applications such as microservers, communication/networking gear, and storage. As a result the C2000 is an important part of Intel’s product lineup – especially as it directly competes with various ARM-based processors in many of its markets – but it’s a name that’s better known to device manufacturers and IT engineers than it is to consumers. Consequently, an issue with the C2000 family doesn’t immediately raise any eyebrows.
Jumping a week into the present, since their earnings call Intel has posted an updated spec sheet for the Atom C2000 family. More importantly, device manufacturers have started posting new product errata notices; and while they are keeping their distance away from naming the C2000 directly, all signs point to the affected products being C2000 based. As a result we finally have some insight into what the issue is with C2000. And while the news isn’t anywhere close to dire, it’s certainly not good news for Intel. As it turns out, there’s a degradation issue with at least some (if not all) parts in the Atom C2000 family, which over time can cause chips to fail only a few years into their lifetimes.
The Problem: Early Circuit Degradation
To understand what’s going on and why C2000 SoCs can fail early, let’s start with Intel’s updated spec sheet, which contains the new errata for the problem.
AVR54. System May Experience Inability to Boot or May Cease Operation
Problem: The SoC LPC_CLKOUT0 and/or LPC_CLKOUT1 signals (Low Pin Count bus clock outputs) may stop functioning.
Implication: If the LPC clock(s) stop functioning the system will no longer be able to boot.
Workaround: A platform level change has been identified and may be implemented as a workaround for this erratum.
At a high-level, the problem is that the operating clock for the Low Pin Count bus can stop working. Essentially a type of legacy bus, the LPC bus is a simple bus for simple peripherals, best known for supporting legacy devices such as serial and parallel ports. It is not a bus that’s strictly necessary for the operation of a computer or embedded device, and instead its importance depends on what devices are being hung off of it. Along with legacy I/O devices, the second most common device type to hang off of the LPC is the boot ROM/BIOS– owing to the fact that it’s a simple device that needs little bandwidth – and this is where the C2000 flaw truly rears its head.
As Intel’s errata succinctly explains, if the LPC bus breaks, then any system using it to host the boot ROM will no longer be able boot, as the system would no longer be able to access said boot ROM. The good news is that Intel has a workaround (more on that in a second), so it’s an avoidable failure, but it’s a hardware workaround, meaning the affected boards have to be reworked to fix them. Complicating matters, since Atom C2000 is a BGA chip being used in an embedded fashion, an LPC failure means that the entire board (if not the entire device) has to be replaced.
Diving deeper, the big question of course is how the LPC bus could break in this fashion. To that end, The Register reached out to Intel and has been able to get a few more details. As quoted by The Register, Intel is saying that the problem is “a degradation of a circuit element under high use conditions at a rate higher than Intel’s quality goals after multiple years of service.”
Though we tend to think of solid-state electronics as just that – solid and unchanging – circuit degradation is a normal part of the lifecycle of a complex semiconductor like a processor. Quantum tunneling and other effects on a microscopic scale will wear down processors while they’re in use, leading to eventual performance degradation or operational failure. However even with modern processors the effect should take a decade or longer, much longer than the expected service lifetime of a chip. So when something happens to speed up the degradation process, if severe enough it can cut the lifetime of a chip to a fraction of what it was planned for, causing a chip (or line of chips) to fail while still in active use. And this is exactly what’s happening with the Atom C2000.
For Intel, this is the second time this decade that they’ve encountered a degradation issue like this. Back in 2011 the company had to undertake a much larger and more embarrassing repair & replacement program for motherboards using early Intel 6-series chipsets. On those boards an overbiased (overdriven) transistor controlling some of the SATA ports could fail early, disabling those SATA ports. And while Intel hasn’t clarified whether something similar to this is happening on the Atom C2000, I wouldn’t be too surprised if it was. Which isn’t to unnecessarily pick on Intel here; given the geometries at play (bear in mind just how small a 22nm transistor is) transistor reliability is a significant challenge for all players. Just a bit too much voltage on a single transistor out of billions can be enough to ultimately break a chip.
The Solution: New Silicon & Reworked Motherboards
Anyhow, the good news is that Intel has developed both a silicon workaround and a platform workaround. The long-term solution is of course rolling out a new revision of the C2000 silicon that incorporates a fix for the issue, and Intel has told The Register they’ll be doing just that. This will actually come somewhat late in the lifetime of the processor, as the current B0 revision was launched three and a half years ago and will be succeeded by Denverton this year. At the same time though, as an IT-focused product Intel will still need to offer the Atom C2000 series to customers for a number of years to come, so even with the cost of a new revision of the silicon, it’s in Intel’s long-term interest.
More immediately, the platform fix can be used to prevent the issue on boards with the B0 silicon. Unfortunately Intel isn’t disclosing just what the platform fix is, but if it is a transistor bias issue, then the fix is likely to involve reducing the voltage to the transistor, essentially bringing its degradation back to expected levels. Some individual product vendors are also reporting that the fix can be reworked into existing (post-production) boards, though it sounds like this can only prevent the issue, not fix an already-unbootable board.
Affected Products: Routers, Servers, & NASes
As a result of the nature of a problem, the situation is a mixed bag for device manufacturers and owners. First and foremost, while most manufacturers have used the LPC bus to host the boot ROM, not all of them have. For the smaller number of manufacturers who are using SPI Flash, this wouldn’t impact them unless they were using the LPC bus for something else. Otherwise for those manufacturers who are impacted, transistor degradation is heavily dependent on ambient temperature and use: the hotter a chip and the harder its run, the faster a transistor will degrade. Consequently, while all C2000 chips have the flaw, not all C2000 chips will have their LPC clock fail before a device reaches the end of its useful lifetime. And certainly not all C2000 chips will fail at the same time.
Cisco, whose routers are impacted, estimates that while issues can occur as early as 18 months in, they don’t expect a meaningful spike in failures until 3 years (36 months) in. This of course happens to be just a bit shorter than the age of the first C2000 products, which is likely why this issue hasn’t come to light until now. Failures would then become increasingly likely as time goes on, and accordingly Cisco will be replacing the oldest affected routers first, as they’re the most vulnerable to the degradation issue.
As for other vendors shipping Atom C2000-based products, those vendors are setting up their own support programs. Patrick Kennedy over at ServeTheHome has already started compiling a list of vendor responses, including Supermicro and Netgate. However as it stands a lot of vendors are still developing their response to the issue, so this will be an ongoing process.
Finally, what’s likely to be the most affected on the consumer side of matters will be on the Network Attached Storage front. As pointed out to me by our own Ganesh TS, Seagate, Synology, ASRock, Advantronix, and other NAS vendors have all shipped devices using the flawed chips, and as a result all of these products are vulnerable to early failures. These vendors are still working on their respective support programs, but for covered devices the result is going to be the same: the affected NASes will need to be swapped for models with fixed boards/silicon. So NAS owners will want to pay close attention here, as while these devices aren’t necessarily at risk of immediate failure, they are at risk of failure in the long term.
Sources: Tom’s Hardware, The Register, & ServeTheHome
…