Proxmox - Replacing failed drive in ZFS pool
I have a Dell r720 server with enterprise grade SSDs for my homelab. This powerhouse feeds all the home services, ad-blockers, and this blog!
First time in a recent history that one of my drives failed ?.
The first challenge was to identify which drive has failed! Unfortunately, I couldn't find it an easier way. I had to pull drives one by one to see which one has failed ? . Once identified, I ordered a replacement SSD.
This is what you will see if you'd want to see the status
of your zfs pool
:
replicator# zpool status
pool: backups
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 540K in 0 days 05:28:45 with 0 errors on Wed Mar 6 18:51:22 2019
config:
NAME STATE READ WRITE CKSUM
backups DEGRADED 0 0 0
raidz3-0 DEGRADED 0 0 0
gptid/a38d9d54-2470-11e7-be70-ac220b8c944c ONLINE 0 0 0
gptid/18b749b3-c0b6-11e7-81f8-ac220b8c944c ONLINE 0 0 0
15678995806359064346 OFFLINE 0 0 0 was /dev/gptid/8016f0cf-5557-11e4-a84e-ac220b8c944c
gptid/a0fb3bb7-c685-11e4-acbc-ac220b8c944c ONLINE 0 0 0
gptid/80f31598-5557-11e4-a84e-ac220b8c944c ONLINE 0 0 0
errors: No known data errors
I have a pool name backups
with 5 drives in total and one of them has failed and showing an OFFLINE
state.
Simply identify and remove the failed drive, and replace with the new drive. Make sure you don't partition the new drive. The resilvering process will do it for you automatically!
The next step is to identify the new drive's device id
. The easiest way is to go to Proxmox GUI, Click on the name of your instance > Disks
I was running the same make and model for the pool but the new drive was a different model so it was easier for me to identify that it was mounted on /dev/sdm
. You can also try to copy the previous device id and run the command:
ls -la /dev/disk/by-id | grep -i 'previous-device-id-here'
This will tell you where the device was mounted.
Replace the drive by running this command:
zpool replace faster 9181524188806271229 /dev/sdm
the syntax of the above command is as follows: # zpool replace <pool> <old device> <new device>
This will start the resilvering
process and replace the dead drive. You can check the progress by running the zpool status
command...
Notice the speed of the resilvering drive ?
ZFS makes it super easy to replace a dead drive!
Once the process completes, your pool will no longer be in the DEGRADED
state and will become ONLINE
I would also recommend the following steps after the resilvering is completed:
# scrub your pool
zpool scrub [your-pool-name]
#run smart tests on the new drive:
smartctl -t long /dev/sdX
X = the new replacement disk
We're done now!