Home NAS with ZFS – failed disk!

It’s been over two years since I built my home NAS box.  A couple days ago I logged in and noticed some file operation was taking a while, so I did a “zpool status” and was shocked to see one of the disks had failed at some point.  I really have no idea when it happened, but the “last scan” date said something about December, so it could have been months.  I was a bad Sysadmin and didn’t have alerting for this case – silent HA failover is a risk most people don’t think of, though coincidentally it’s one of the projects I’m currently working on at work.

Anyway, here’s what the failed zpool looked like:

root@lunix:~# zpool status
  pool: lunix1
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 12h54m with 0 errors on Sun Dec  2 14:18:15 2018
config:

        NAME                      STATE     READ WRITE CKSUM
        lunix1                    DEGRADED     0     0     0
          mirror-0                ONLINE       0     0     0
            sdb                   ONLINE       0     0     0
            sdc                   ONLINE       0     0     0
          mirror-1                DEGRADED     0     0     0
            sdd                   ONLINE       0     0     0
            11769402787959493007  UNAVAIL      0     0     0  was /dev/sde1

errors: No known data errors


Fortunately, back when I built this thing, I bought a spare disk which has been sitting on the shelf in the box from Newegg for 2 years. I replaced it a little while ago and ran “zpool import” to bring the zpool back in, and then did the “zpool replace”:

root@lunix:~# zpool replace -f lunix1 11769402787959493007 sdd
root@lunix:~# zpool status
  pool: lunix1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Apr 26 15:47:41 2019
    21.9M scanned out of 3.71T at 3.12M/s, 345h50m to go
    10.5M resilvered, 0.00% done
config:

        NAME                        STATE     READ WRITE CKSUM
        lunix1                      DEGRADED     0     0     0
          mirror-0                  ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
          mirror-1                  DEGRADED     0     0     0
            sde                     ONLINE       0     0     0
            replacing-1             UNAVAIL      0     0     0
              11769402787959493007  FAULTED      0     0     0  was /dev/sde1
              sdd                   ONLINE       0     0     0  (resilvering)

errors: No known data errors

345 hours to go?!  I ran zpool status about 20 minutes later and got a much better number:

root@lunix:~# zpool status
  pool: lunix1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Apr 26 15:47:41 2019
    125G scanned out of 3.71T at 101M/s, 10h18m to go
    62.3G resilvered, 3.28% done
config:

        NAME                        STATE     READ WRITE CKSUM
        lunix1                      DEGRADED     0     0     0
          mirror-0                  ONLINE       0     0     0
            sdb                     ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
          mirror-1                  DEGRADED     0     0     0
            sde                     ONLINE       0     0     0
            replacing-1             UNAVAIL      0     0     0
              11769402787959493007  FAULTED      0     0     0  was /dev/sde1
              sdd                   ONLINE       0     0     0  (resilvering)

errors: No known data errors

Anyway, this was my first experience replacing a failed disk under a ZFS raid, so it was pretty exciting.  The hardest part of the process was figuring out which disk was the failed one.  The bad one didn’t even show up as a device in /dev, so what I ended up doing was “hdparm -i /dev/sd?” for each disk that was there, noting the serial number, and then looking on the outside of each disk to see which one had a serial that wasn’t available to the OS.  Overall it was a pretty good process.

 

 

 

Leave a comment