zfs - CKSUM errors (and a bad SATA cable)

zfs - CKSUM errors (and a bad SATA cable)

Just some days ago, I run a zpool scrub on my ZFS RAID-z array on my home NAS. This is actually a piece of beauty - Powered by a low-energy Celeron and using FreeBSD with ZFS, this is our main storage system for photos, pictures, data, disk images, ecc. It just powers the whole home infrastructure. So during a normal routine zpool scrub, I noticed CKSUM errors popping up.

zfs CKSUM errors

Now, this is not good as CKSUM indicates the number of uncorrectable checksum errors. So, rapid action is required!
As the affected disk is a rather old disk, my first guess was, that this is an indicator of a disk going bad. So, first I run smartctl to check the smart status. So, first I checked the glabel to find the underlying physical device

use glabel status to map the output of zpool status to the underlying hardware
Output of smartctl -a /dev/ada4

The smart status looked good, but I still decided to replace the disk, as the CKSUM errors made me nervous. I put into the new disk and replaced it with the default zfs tools via

The zpool replace command immediately exploded into my face. In dmesg I could trace back some weird messages with something like "No such pool or dataset"
Now, that's weird. So apparently the issue was not the disk itself, but what else? Probably it is rather just a faulty connection. Could be a failing SATA controller or just a bad SATA cable. The cable was working just fine for years, I didn't changed anything so I was wondering. Normally I always suspect moving or mechanical parts first to fail, so a plain cable is way below a failing HDD when CKSUM errors appear in my priority list.
Yeah, I was wrong. Just replaced the SATA cable (because I had one and it was orders of magnitude easier than replacing a SATA controller) healed the whole system. Now zpool scrub runs happily through for the second time (the first time it had still to recover some errors, probably from faulty writes in the first place) and now the NAS is running smoothly as it was doing always.

No errors, everything is happy again 🙂

So, in a nutshell

CKSUM errors in ZFS without any READ or WRITE errors are sometimes also just triggered by a faulty connection (a bad SATA cable) and do not necessarily indicate a failing hard disk.

Lucky me, now I have a sparse disk on stock in case one day one disk really goes bad 🙂

Leave a Comment