SSD reliability in the real world: Google’s experience

“SSDs are a new phenomenon in the datacenter. We have theories about how they should perform, but until now, little data,” Robin Harris writes for ZDNet. “That’s just changed.”

“The FAST 2016 paper Flash Reliability in Production: The Expected and the Unexpected, (the paper is not available online until Friday) by Professor Bianca Schroeder of the University of Toronto, and Raghav Lagisetty and Arif Merchant of Google, covers: Millions of drive days over 6 years, 10 different drive models, 3 different flash types: MLC, eMLC and SLC, [and] Enterprise and consumer drives,” Harris writes.

“Two standout conclusions from the study. First, that MLC [Multi-level cell] drives are as reliable as the more costly SLC [single-level cell] ‘enteprise’ drives. This mirrors hard drive experience, where consumer SATA drives have been found to be as reliable as expensive SAS and Fibre Channel drives,” Harris writes. “The paper’s second major conclusion, that age, not use, correlates with increasing error rates, means that over-provisioning for fear of flash wearout is not needed.”

Read more in the full article here.

MacDailyNews Take: Harris reminds and we’d like to highlight: “Backing up SSDs is even more important than it is with disks.” See also from ComputerWeekly: MLC vs SLC: Which flash SSD is right for you?

12 Comments

  1. “Uncorrectable Bit Error Rate (UBER) specs. SSD UBER rates are higher than disk rates, which means that backing up SSDs is even more important than it is with disks. The SSD is less likely to fail during its normal life, but more likely to lose data.”
    How does TimeMachine handle files it can no longer read? Will it stop backing up? Just not back up the damaged file? Back it up anyway even if unreadable?

    0
    1
    1. The point is to use a combination of RAID (like RAID 60) and Forward Error Correction (FEC) code (like a combination of LDPC or TPC and BCH) to be able to recover down to a BER of 10^-14 or 10^-15 for truly critical data sets. Some systems even add an interleaving step (mixing the bits out of order), but this can possibly work negatively against the RAID implementation of something like RAID 60.

      One thing most people forget is the cascade effect of encryption. If you encrypt a file with typical, strong encryption (like a common mode and hash of AES-256) then for every bit error of encrypted data that cascades into as many as 100 bit errors of unencrypted data. If, per chance, that 100 bits are a contiguous series (unlikely, but possible) this could overwhelm any FEC recovery. This is one of the things that pushes storage BER rates down — often to the 10^-14 or 10^-15 range for critical data sets.

      Just remember the old data storage golden rule:
      If you have just one copy of your data, eventually you WILL have zero copies of that data!

      0
      1
    2. Time Machine will just inexplicably corrupt your entire backup. Meaning you should also back-up to an external disk via Carbon Copy Cloner and an offsite back-up such as CrashPlan, Carbonite, BackBlaze, or Mozy as well.

      0
      1
  2. ““Backing up SSDs is even more important than it is with disks.”

    Unless you kick the cord on your external disk drive and it goes flying to the ground, or you drop your laptop with a spinning hard drive.
    In both those cases, you are MUCH better off with an SSD. Speaking from painful firsthand experience.

    0
    1
    1. Apple has had various implementations of ZFS in the wings for several years. For a while (and maybe still) you could bolt on a specific ZFS implementation onto Mac OS X. There was a very strong rumor that Apple was going to introduce ZFS for Mac two or three years back.

      Apple believes none of the ones Apple has been investigating is ready for prime time yet.

      0
      1
        1. It’s a file system, not a logic system. Ubuntu and others have made ZFS a reality, and with current fast processing and storage media, it is also now fast enough to be ready for mass deployment.

          But as I frequently have to point out to the MDN crowd, Apple leadership is too slow. Cook is a copier, not an innovator. He already largely lost huge chunks of professional computing by refusing to offer the fastest desktop hardware or any server hardware of any kind, now Ubuntu cements its lead in these areas with an obvious edge in data robustness. How bad do things have to get before average Mac users get alarmed?

          0
          1

Reader Feedback

This site uses Akismet to reduce spam. Learn how your comment data is processed.