Always monitor your RAID

TLDR: If you don’t like unpleasant surprises, then ALWAYS monitor your RAID systems.

# A mystery story

Last week when I started my day, I noticed something strange: My blog (the one you’re reading) suddenly only showed posts until end of July. So I checked my watch and after realizing that we didn’t magically travelled back in time, I know something was off.

My blog is a git repository, so I logged into the server. The repository was on the state of end of July ๐Ÿคจ Weird … Ok, then I checked the state of my forgejo and figured that also forgejo had only the state of end of July ๐Ÿคจ๐Ÿคจ๐Ÿคจ

Something off. Something is possibly maybe horribly mildly off …

I’m trying to figure out what could be the reason. I know that on Wednesdays night there are a couple of maintenance jobs running, e.g. creating VM snapshots, doing some updates, … In short: $stuff happens. Atop of this, also the updates of the VM in question and of the hypervisor and of the corresponding podman containers is automated, so it’s not that easy to identify a single culprit just from here. I need to dig a bit deeper.

First thing I wanted to check, if this issue is related to that single VM. So I checked my Matrix server. The messages are ok, I see new messages from over the night and also from the last days, so everything seems to be ok there. However I noticed that some media files are not accessible anymore ๐Ÿคจ

Next I checked Nextcloud. I immediately realized that some of the notes I keep there are out of sync. They are outdated to a state that could be end of July. At this point I was puzzled, how Matrix could be up-to-date but at least two other VMs show signs that they were outdated.

It was clear to me that something IS off. So I did what I considered the most sane way to do, which is to shut all VMs down, notify the users of ongoing server issues and investigate. This is a bummer, but better safe than sorry. For the record: This is also the first and only issue of this scale since I run my services, years back.

First I created VM snapshots of every VM to have a safe restore point. Then I start my deep-dive with the VM where my blog is hosted. There I have a easy way of checking the state: The git repository of the blog itself. My first check is to see if the last snapshot contains more data and if this might be an issue with the current snapshot or the automation. At this point I had no clue where to start, to I thought best to walk back in time and see what went wrong. After creating a current snapshot, I wanted to restore the VM snapshot of last week:

# virsh snapshot-revert --domain feldspaten --snapshotname auto-2023-09-20

The process ran for some time and then it failed:

error: Failed to revert snapshot auto-2023-09-20
error: internal error: Child process (/usr/bin/qemu-img snapshot -a auto-2023-09-20 /mnt/RAID/virt/feldspaten.qcow2) unexpected exit status 1: qemu-img: Could not apply snapshot 'auto-2023-09-20': Failed to load snapshot: No such file or directory

I know this error, this means that the snapshot is not present in the file. So apparently there is something very wrong with the disk images.

I checked anther VM, there I got the same error message. What puzzled me is that the snapshot revert takes some time, and fails only after like 30 seconds or so. That is weird, it should fail immediately or at least within seconds, not tenths of seconds.

But this gave me also a clue. My intuition was pointing me into a certain direction and I didn’t like this direction at all. But before we advance there, let me explain to you how this server is setup and how I manage my stuff here.

This server is a dedicated root server. It has one 500 GB SSD and 2x3 TB in RAID1. Every VM has a system disk on the SSD and a data disk on the RAID1 or SSD, depending on the workload. The database of the Matrix server for instance is a VM image on the SSD, while the media files are in a VM image on the RAID. Like this I can put the workloads to the best suited storage medium, while having the advantages of both, HDDs (storage capacity) and SSDs (low latency, performance). So with this knowledge in mind, can you guess what could have gone wrong? ๐Ÿ™‚

I went and checked the RAID status.

Personalities : [raid1]
md127 : active raid1 sdb[0]
      2930266432 blocks super 1.0 [2/1] [U_]
      bitmap: 18/22 pages [72KB], 65536KB chunk

unused devices: <none>

See the U_. This means one of the disk is up (U), the other one is missing or degrated (_).

Can you guess now that went wrong? For some reason, the RAID kicked out one of the disks. I checked the mail on the server and noticed that there was an event back in July! I forgot to setup mail notification for madam, so it went unnoticed (decent monitoring is still WIP). This is bad and I take all the blame for this issue. So when the server rebooted after creating VM snapshots and stuff, for some reason mdadm decided to use the hard disk that went out-of-sync months ago. But because I only keep part of the data on the RAID and other data on the SSD, only the data on the RAID was outdated. Remember the Matrix instance that showed me the messages but not the media? The reason is that the database for the messages is on the SSD, while the media files are on the RAID.

Now it’s clear what happened: The RAID booted into a out-of-sync disk state from July, and so all VM disk images on the RAID were also travelling back in time. The disk images on the SSD didn’t travel back in time.

# Recovery

To be sure sure I pulled all current VMs to an external backup storage, and then restored the downgraded RAID array by triggering a rebuild of the outdated disk

# mdadm /dev/md/RAID --remove /dev/sdc
# mdadm --zero-superblock /dev/sdc
# mdadm /dev/md/RAID --add /dev/sdc

After one night the RAID was rebuild, and now it’s happy again:

Personalities : [raid1]
md127 : active raid1 sdc[2] sdb[0]
      2930266432 blocks super 1.0 [2/2] [UU]
      bitmap: 4/22 pages [16KB], 65536KB chunk

unused devices: <none>

A reboot later and I had all the data from Wednesdays night back. All VMs are running and the state is mostly consistent again.

I still need to do some manual work to prevent that I loose data from between the reboot and restoring the RAID (e.g. mails that were received during this period), but apart from that everything is now back to where it’s supposed to be.

# Lesson learned

Never ever run a RAID without monitoring.

This is a lesson that I thought I already learned, but I forgot to setup at least mail notification on a rather important system. Mail alone is not sufficient monitor, but it is at least something you can turn on rather easy. A decent monitoring system is still WIP.

Simply put: ALWAYS monitor your RAID and zpools, if you don’t like unpleasant surprised or “surprise backup restore tests”. They can be valuable but also contribute to more gray hair than necessary. Cheers ๐Ÿ˜ƒ