It’s been over two years since I built my home NAS box. A couple days ago I logged in and noticed some file operation was taking a while, so I did a “zpool status” and was shocked to see one of the disks had failed at some point. I really have no idea when it happened, but the “last scan” date said something about December, so it could have been months. I was a bad Sysadmin and didn’t have alerting for this case – silent HA failover is a risk most people don’t think of, though coincidentally it’s one of the projects I’m currently working on at work.
Anyway, here’s what the failed zpool looked like:
root@lunix:~# zpool status
pool: lunix1
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: scrub repaired 0 in 12h54m with 0 errors on Sun Dec 2 14:18:15 2018
config:
NAME STATE READ WRITE CKSUM
lunix1 DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
mirror-1 DEGRADED 0 0 0
sdd ONLINE 0 0 0
11769402787959493007 UNAVAIL 0 0 0 was /dev/sde1
errors: No known data errors
Fortunately, back when I built this thing, I bought a spare disk which has been sitting on the shelf in the box from Newegg for 2 years. I replaced it a little while ago and ran “zpool import” to bring the zpool back in, and then did the “zpool replace”:
root@lunix:~# zpool replace -f lunix1 11769402787959493007 sdd
root@lunix:~# zpool status
pool: lunix1
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Fri Apr 26 15:47:41 2019
21.9M scanned out of 3.71T at 3.12M/s, 345h50m to go
10.5M resilvered, 0.00% done
config:
NAME STATE READ WRITE CKSUM
lunix1 DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
mirror-1 DEGRADED 0 0 0
sde ONLINE 0 0 0
replacing-1 UNAVAIL 0 0 0
11769402787959493007 FAULTED 0 0 0 was /dev/sde1
sdd ONLINE 0 0 0 (resilvering)
errors: No known data errors
345 hours to go?! I ran zpool status about 20 minutes later and got a much better number:
root@lunix:~# zpool status
pool: lunix1
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Fri Apr 26 15:47:41 2019
125G scanned out of 3.71T at 101M/s, 10h18m to go
62.3G resilvered, 3.28% done
config:
NAME STATE READ WRITE CKSUM
lunix1 DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
mirror-1 DEGRADED 0 0 0
sde ONLINE 0 0 0
replacing-1 UNAVAIL 0 0 0
11769402787959493007 FAULTED 0 0 0 was /dev/sde1
sdd ONLINE 0 0 0 (resilvering)
errors: No known data errors
Anyway, this was my first experience replacing a failed disk under a ZFS raid, so it was pretty exciting. The hardest part of the process was figuring out which disk was the failed one. The bad one didn’t even show up as a device in /dev, so what I ended up doing was “hdparm -i /dev/sd?” for each disk that was there, noting the serial number, and then looking on the outside of each disk to see which one had a serial that wasn’t available to the OS. Overall it was a pretty good process.
Leave a comment