It’s been over two years since I built my home NAS box. A couple days ago I logged in and noticed some file operation was taking a while, so I did a “zpool status” and was shocked to see one of the disks had failed at some point. I really have no idea when it happened, but the “last scan” date said something about December, so it could have been months. I was a bad Sysadmin and didn’t have alerting for this case – silent HA failover is a risk most people don’t think of, though coincidentally it’s one of the projects I’m currently working on at work.
Anyway, here’s what the failed zpool looked like:
root@lunix:~# zpool status pool: lunix1 state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-4J scan: scrub repaired 0 in 12h54m with 0 errors on Sun Dec 2 14:18:15 2018 config: NAME STATE READ WRITE CKSUM lunix1 DEGRADED 0 0 0 mirror-0 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 mirror-1 DEGRADED 0 0 0 sdd ONLINE 0 0 0 11769402787959493007 UNAVAIL 0 0 0 was /dev/sde1 errors: No known data errors
Fortunately, back when I built this thing, I bought a spare disk which has been sitting on the shelf in the box from Newegg for 2 years. I replaced it a little while ago and ran “zpool import” to bring the zpool back in, and then did the “zpool replace”:
root@lunix:~# zpool replace -f lunix1 11769402787959493007 sdd root@lunix:~# zpool status pool: lunix1 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Fri Apr 26 15:47:41 2019 21.9M scanned out of 3.71T at 3.12M/s, 345h50m to go 10.5M resilvered, 0.00% done config: NAME STATE READ WRITE CKSUM lunix1 DEGRADED 0 0 0 mirror-0 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 mirror-1 DEGRADED 0 0 0 sde ONLINE 0 0 0 replacing-1 UNAVAIL 0 0 0 11769402787959493007 FAULTED 0 0 0 was /dev/sde1 sdd ONLINE 0 0 0 (resilvering) errors: No known data errors
345 hours to go?! I ran zpool status about 20 minutes later and got a much better number:
root@lunix:~# zpool status pool: lunix1 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Fri Apr 26 15:47:41 2019 125G scanned out of 3.71T at 101M/s, 10h18m to go 62.3G resilvered, 3.28% done config: NAME STATE READ WRITE CKSUM lunix1 DEGRADED 0 0 0 mirror-0 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 mirror-1 DEGRADED 0 0 0 sde ONLINE 0 0 0 replacing-1 UNAVAIL 0 0 0 11769402787959493007 FAULTED 0 0 0 was /dev/sde1 sdd ONLINE 0 0 0 (resilvering) errors: No known data errors
Anyway, this was my first experience replacing a failed disk under a ZFS raid, so it was pretty exciting. The hardest part of the process was figuring out which disk was the failed one. The bad one didn’t even show up as a device in /dev, so what I ended up doing was “hdparm -i /dev/sd?” for each disk that was there, noting the serial number, and then looking on the outside of each disk to see which one had a serial that wasn’t available to the OS. Overall it was a pretty good process.