Proxmox - Replacing failed drive in ZFS pool

Proxmox - Replacing failed drive in ZFS pool
Dell Server in a Homelab running ZFS

I have a Dell r720 server with enterprise grade SSDs for my homelab. This powerhouse feeds all the home services, ad-blockers, and this blog!

First time in a recent history that one of my drives failed ?.

The first challenge was to identify which drive has failed! Unfortunately, I couldn't find it an easier way. I had to pull drives one by one to see which one has failed ? . Once identified, I ordered a replacement SSD.

This is what you will see if you'd want to see the status of your zfs pool:

replicator# zpool status
  pool: backups
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 540K in 0 days 05:28:45 with 0 errors on Wed Mar  6 18:51:22 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        backups                                         DEGRADED     0     0     0
          raidz3-0                                      DEGRADED     0     0     0
            gptid/a38d9d54-2470-11e7-be70-ac220b8c944c  ONLINE       0     0     0
            gptid/18b749b3-c0b6-11e7-81f8-ac220b8c944c  ONLINE       0     0     0
            15678995806359064346                        OFFLINE      0     0     0  was /dev/gptid/8016f0cf-5557-11e4-a84e-ac220b8c944c
            gptid/a0fb3bb7-c685-11e4-acbc-ac220b8c944c  ONLINE       0     0     0
            gptid/80f31598-5557-11e4-a84e-ac220b8c944c  ONLINE       0     0     0

errors: No known data errors

I have a pool name backups with 5 drives in total and one of them has failed and showing an OFFLINE state.

Simply identify and remove the failed drive, and replace with the new drive. Make sure you don't partition the new drive. The resilvering process will do it for you automatically!

The next step is to identify the new drive's device id. The easiest way is to go to Proxmox GUI, Click on the name of your instance > Disks

I was running the same make and model for the pool but the new drive was a different model so it was easier for me to identify that it was mounted on /dev/sdm. You can also try to copy the previous device id and run the command:

ls -la /dev/disk/by-id | grep -i 'previous-device-id-here'

This will tell you where the device was mounted.

Replace the drive by running this command:

zpool replace faster 9181524188806271229 /dev/sdm

the syntax of the above command is as follows: # zpool replace <pool> <old device> <new device>

This will start the resilvering process and replace the dead drive. You can check the progress by running the zpool status command...

Notice the speed of the resilvering drive ?

ZFS makes it super easy to replace a dead drive!

Once the process completes, your pool will no longer be in the DEGRADED state and will become ONLINE

I would also recommend the following steps after the resilvering is completed:

# scrub your pool
zpool scrub [your-pool-name]

#run smart tests on the new drive:
smartctl -t long /dev/sdX

X = the new replacement disk

We're done now!