Skip to content
Naked Security Naked Security

HPE warns of impending SSD disk doom

The company has revealed that many of its SSDs are set to permanently fail by default after 32,768 hours of operation.

Techies are used to worrying about the longevity of their data storage. Hard drive heads used to have a nasty habit of crashing before laptops introduced software to protect them from drops and power surges. ‘Data rot‘ can damage your DVD storage, and magnetic tape can suffer as its substrates and binders degrade.

But what about the firmware, which contains the instructions for reading and writing from the media in the first place? That’s now an issue too, thanks to HPE. It had to recall some of its solid-state drives (SSDs) last week after it found that they were inadvertently programmed to fail.

The company released a critical firmware patch for its serial-attached SCSI (SAS) SSDs, after revealing that they would permanently fail by default after 32,768 hours of operation. That’s right: assuming they’re left on all the time, three years, 270 days, and eight hours after you write your first bit to one of these drives, your records and the disk itself will become unrecoverable.

The company explained the problem in an advisory, adding that an unnamed SSD vendor tipped it off about the issue. These drives crop up in a range of HPE products. If you’re a HPE ProLiant, Synergy, Apollo, JBOD D3xxx, D6xxx, D8xxx, MSA, StoreVirtual 4335, or StoreVirtual 3200 user and you’re using a version of the HP firmware before HPD8, you’re affected.

You might hope that a RAID configuration might save you. RAID disk implementations (other than RAID 0, which focuses on speed), mirror data for redundancy purposes, meaning that you can recover your data if disks in your system go down. However, as HPE points out in its advisory:

SSDs which were put into service at the same time will likely fail nearly simultaneously.

Unless you replaced some SSDs in your RAID box, they’ve probably all been operating for the same amount of time. RAID doesn’t help you if all your disks die at once.

This bug affects 20 SSD model numbers, and to date, HPE has only patched eight of them. The remaining 12 won’t get patched until the week beginning 9 December 2019. So if you bought those disks a few years ago and haven’t got around to backing them up yet, you might want to get on that.

HPE explains that you can also use its Smart Storage Administrator to calculate your total drive power-on hours and find out how close to data doomsday your drive is. Here’s a PDF telling you how to do that.

Unfortunately, HPE didn’t include the same kind of warning that Mission Impossible protagonist Jim Phelps got at the beginning of every episode: “This tape will self destruct in five seconds”.

But then, 117,964,800 seconds is a little harder to scan. In any case, your mission, should you choose to accept it, is to back those records up.

14 Comments

Funny, sounds a lot like the Swedish hospital (SÄS) debacle. Workstations simply died like flies in a very short timespan about a couple of months ago. I can’t se how this is unrelated to be honest. Please shed some light.

32768 = 2**15

Indeed, and 32767+1 equals -32768 if you use a signed 16-bit integer and don’t check for overflow/underflow!

That’s because any unsigned 16-bit number of 2**15 and above has its 16th bit (the most significant bit) set, but in an signed 16-bit number, setting the high bit causes the number to be treated as negative.

That is why signed and unsigned integers can be mixed only with [a] great care or [b] lurking danger. Or perhaps both.

“unnamed SSD vendor” Important details are missing… Glad I don’t use HPE but what other vendors could be affected??

Thanks for this question. It’s a good one, and one that took me a while to get an answer to. I learned that the unnamed vendor was Samsung. However, the firmware defect was restricted to the drives that Samsung had made for HPE.

It’s not a ‘Bug’ whatsoever the evidence shows it’s by design. “Crippling Technology for Profit”. It’s been going on for years, you just have to be old enough to know.

To be fair, all the evidence I’ve seen suggests this was down to incompetent coding and not to malicious design – a 16-bit signed integer has overflowed, wrapped around, and triggered a show-stopping bug.

If you wanted to code this sort of thing deliberately to pitch a new sale to your customers, the one thing you wouldn’t do would be to ruin the hardware at an unpredictable moment that would take both you and the customer by surprise… you’d have a “failure warning” period during which there was time for your sales people to work the pitch and for your customers to get the purchase order ready. (When was the last time your printer ran out of toner abruptly, between one page and the next? I bet you it warned you for ages, and even popped up a web link that made it a matter of moments to order a replacement cartridge in plenty of time.)

Hi Paul,
Interesting. In my reply I was hoping to highlight the non random references in the document to give balance that it may have been by design;
“they were inadvertently programmed to fail.”
and;
“permanently fail by default after 32,768 hours of operation”
The “unnamed SSD vendor” has an interest to continue manufacturing and re-supply through life-cycle replacement.
Who knows if this was their guarantee of future turnover.
OEM Manufacturers are not silly, product “life-cycle” of mass production is the core of their methodology to guarantee throughput of product into the future via upgrades and replacements.
What ever the cause, isn’t it convenient that manufacturing wins.
Maybe this one just got out of the bag.

The words “inadvertently programmed to fail” can really only be interpreted as saying that this was *not* by design, otherwise the author would almost certainly have written “deliberately” instead (on the grounds that no one uses the word “advertently” any more and expects to be taken seriously).

What you seem to be talking about is “planned obsolescence”. The bug we are talking about here doesn’t give you much chance of planning any future business around a product’s end-of-life – you’d be far more likely to earn a reputation for unpredictability, so you’d be much more likely to drive any future turnover into your competitors’ order books than your own.

Hold the conspiracy theories – this one is a SNAFU.

Wilfried Bergmann – you are right on.

A quick google reveals:
A 16-bit integer can store 2**16 (or 65,536) distinct values. In an unsigned representation, these values are the integers between 0 and 65,535; using two’s complement, possible values range from −32,768 to 32,767.

This looks like a simply programming oversight, at the lowest level. Which has now reached the highest level, and the media. Looks like HPE needs to step up their quality control. And yes, I agree robert, how many other firmware’s (from other vendors) are affected? It’s amazing how systems fail, due to the tiniest error. You would imagine that “Power On” statistics (i.e. Poweron hours) would only affect the “Predictive Failure” flag of the SMART statistics.

I think this article needs to be renamed “Millenium Bug for SSD drives”.

It remind me good ORACLE bug – Oracle*Net client stopped working when server uptime became 2**15 (in milliseconds, as I remember) and freeze for next 2**15 milliseconds, then it started to work again next 2**15 ,and so on (not sure about exact numbers but it resulted in something less then 1 year). No one QA can test for this because QA systems never ever works so long without reboot.

What is bad – why disks STOP WORKING? I can understand if disks stop WRITING, but what the design when SSD disk can not even read?

It reminds me different problem. SSD has limited write tolerance so in time disk can stop to accept new data. OK, but data still exist. So, how many SAN storage designs allow such disks to continue to work in READ mode so that admins can insert spare disk or copy data from the raid, and not see the whole system failed as a total”

2**15 is 32,767. In milliseconds, that’s just over half a minute. There are just under 32 billion milliseconds in a year, so the first power-of-2 that is bigger than that is 2**35. (Failure at 2**31 or 2**32 seems much more likely because signed long ints run out after 31 bits and unsigned long ints run out after 32 bits.)

As for SSDs falling back into read-only mode when the time-counter exceeds a certain value so they can be imaged and replaced — there’s nothing to stop the disk device driver or the disk mounting command doing that anyway. Remember that this bug isn’t supposed to stop the drive working at all, just to let you monitor its usage to decide when or if to replace it.

Comments are closed.

Subscribe to get the latest updates in your inbox.
Which categories are you interested in?
You’re now subscribed!