Featured image of post A faulted NAS drive and the loooooong wait for replacement

A faulted NAS drive and the loooooong wait for replacement

Last month one of my NAS hard disks died. I’m running a zRAID1 array (“RAID5”) with 3 Seagate Ironwolf drives (real NAS drives, not the cheap consumer hardware). SMART reported errors on one of those drives and when investigating I see that the zpool was DEGRADED and zfs spit out the drive in question. It’s marked a FAULTED.

Degraded zpool with a FAULTED drive

Smart is also complaining:

The following warning/error was logged by the smartd daemon:

Device: /dev/sde [SAT], Self-Test Log error count increased from 0 to 1

Device info:
ST12000VN0008-<redacted>, S/N:<redacted>, WWN:<redacted>, FW:<redacted>, 12.0 TB

Device: /dev/sde [SAT], 8 Currently unreadable (pending) sectors

...

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     15584         -
# 2  Short offline       Completed: read failure       30%     15560         -
# 3  Short offline       Completed without error       00%     15536         -
# 4  Extended offline    Completed without error       00%     15530         -
# 5  Short offline       Completed without error       00%     15512         -
# 6  Short offline       Completed without error       00%     15488         -
# 7  Short offline       Completed without error       00%     15464         -
# 8  Short offline       Completed without error       00%     15440         -
# 9  Short offline       Completed without error       00%     15416         -
#10  Short offline       Completed without error       00%     15392         -
#11  Short offline       Completed without error       00%     15368         -
#12  Extended offline    Interrupted (host reset)      20%     15359         -
#13  Short offline       Completed without error       00%     15344         -
#14  Short offline       Completed without error       00%     15320         -
#15  Short offline       Completed without error       00%     15296         -
#16  Short offline       Completed without error       00%     15272         -
#17  Short offline       Completed without error       00%     15248         -
#18  Short offline       Completed without error       00%     15224         -
#19  Short offline       Completed without error       00%     15200         -
#20  Extended offline    Interrupted (host reset)      60%     15183         -
#21  Short offline       Completed without error       00%     15176         -

Uh oh …

To exclude just some weird issues with the PCI Bus, SATA cables or whatnot I did a full system reboot and noticed weird noises when the disk tried to spin up. Yep, that’s it, that drives is gone. Probably a head crash, judging based on the scratching noises.

And this is how a broken hard disk sounds like. Listen to the scratching noises at 30 seconds and ~1minute.

Broken HDD noises

I still had guarantee on the drive, so I went to the hardware store, where I got a full refund. They would have replaced the disk immediately, but they were out of stock. However I was promised they would be on stock again beginning of February, where I can buy a new disk with new guarantee and everything. I was a happy customer being happy about the happy customer service 😀.

Except for …

Beginning of February happens, and I was daily checking the website when the drive would be available. To reduce the risk of another disk failure while the spare disk is gone I shut down the NAS. This sucks because there is all of my stuff, including backups of other machines and such … So, when this system is down, it causes me a mayor inconvenience. Day after day I’m checking the store for a new disk. “Temporary sold out” became my constant companion. I’m locked in because I can spend the refund only on that particular store and those disks are not being cheap.

As more time passes by I get increasingly nervous and annoyed. Mid of this week, it’s the 20th February, we finally decided to call the supplier. To my big surprise they told us … They have new drives! It might just take them 1-2 days to feed them into the system (?) so they would be on stock again. And … it did happen. On Friday I could order two brand new disks. On Sunday they arrived. The joy of being able to access my own stuff again, ahhh. At last, relief! The resolvering of the zpool worked nicely and now I have a fully functional zpool again 😀

zfs resilvering in progress

Note aside: Yeah, I could turn the NAS on at any time and would have access to my stuff. I have an off-site backup and the important stuff is mirrored to another device as well, but there is also some not-backed up stuff there that would be inconvenient to loose. Not a drama, but a ugggghhh-moment, that I don’t want to have. So technically it can be seen as a self-inflicted precaution measurement, but hey, that’s who I am. Better safe than sorry.

Lesson learned

Important lesson learned: When you rely on a NAS system to remain online, and you don’t want to wait for a month for a replacement disk, ensure you have a spare disk.