TrueNAS Pool Degraded: The Case of the "Dead" Drive That Wasn't

Connect to server to find critical alert on my TrueNAS Scale server. My main photo backup pool was degraded, and a drive was completely missing from the configuration.

SELF HOSTINGFEATURED

1/19/20263 min read

I recently connected to my TrueNAS server to find a critical alert! My main photo backup pool was degraded, and a drive was completely missing from the configuration.

The message was terrifying, but it contained a glimmer of hope: "Sufficient replicas exist." My RAIDZ1 setup was doing its job, keeping my data accessible with one drive down. But I was now flying without a safety net. It was time to troubleshoot, and I feared the worst for my month-old 8TB WD Red Plus drive.

Phase 1: The "Zombie" Drive Boot Loop

My first instinct was to reboot the server, hoping it was just a transient glitch. That was a mistake. The server wouldn't shut down cleanly. The web interface died, but the machine stayed powered on.

Connecting a monitor revealed the issue. The operating system was hanging, waiting indefinitely for the missing drive to synchronize its cache before powering off. It was stuck in a loop, throwing I/O errors and timing out, unable to get a response from the problem disk.

With the keyboard unresponsive, I had to perform a hard reset by holding the power button for 10 seconds. I decided to try booting it back up with the drive still connected to see if I could get SMART data, fully prepared to force-kill it again if it hung. Fortunately, the "zombie" drive responded just enough to let the system boot.

Phase 2: Investigation & Mistaken Identity

Once back in the web GUI, the pool was still degraded, but the system was live. I needed to verify which physical drive was the culprit. Since Linux drive letters (like /dev/sdb) can change between reboots based on the order drives respond, I couldn't rely on the old error logs. I needed to map the current drive letters to their serial numbers.

I jumped into the shell and used lsblk to find my missing drive:

Aha! The drive with serial number WD-RD39**** (the one TrueNAS complained about) was currently assigned the letter sdc. The GUI also showed sdc with multiple errors. I had my target.

Phase 3: The Smoking Gun (SMART Data)

It was time to interrogate the drive with smartctl. Given the symptoms, I was expecting to see reallocated sectors or other signs of physical death. I ran the command on the target drive letter:

The output was surprising. The overall health test passed, and the critical physical health indicators were perfect:

ID 5 (Reallocated_Sector_Ct): 0
ID 197 (Current_Pending_Sector): 0

The drive platters were physically fine. But scrolling down to the vendor-specific attributes revealed the real culprit:

UDMA CRC Error Count: 185. This error means data is being corrupted in transit between the hard drive's logic board and the motherboard's SATA controller. The drive was perfectly healthy, but the SATA cable it was using to talk to the server was full of static. TrueNAS had kicked the drive out because it was untrustworthy. This is almost always caused by a faulty or poorly seated SATA cable.

Phase 4: The Fix & Verification

The fix was simple, and thankfully cheaper than a new 8TB drive. I shut down the server, opened the case, and replaced the SATA data cable connected to the drive with serial WD-RD39**** with a known-good spare. After booting back up, I needed to verify the fix.

Note: As expected, the drive letter had shifted again after the reboot. A quick check of lsblk showed my target drive was now /dev/sda.

I ran the SMART check again on the drive's new location to monitor the error count.

The result confirmed the fix. The UDMA_CRC_Error_Count was still at a raw value of 185. Since this is a lifetime counter that never resets, the fact that it didn't increase to 186 meant that zero new communication errors had occurred during the boot process with the new cable. The connection was stable.

Bringing the Pool Back to Life

Even though the hardware was fixed, TrueNAS still showed the pool as DEGRADED because it remembered the old error state. I had to manually tell ZFS to clear the error history for that pool.

Finally, to ensure complete data integrity and prove the drive wouldn't drop offline under load, I ran a full pool scrub from the GUI. This forces ZFS to read every block of data, verify its checksum, and repair any inconsistencies from parity data.

The scrub finished in about an hour with 0 checksum errors. The pool returned to a healthy ONLINE state, and my data was safe.

Conclusion

What looked like a disastrous hard drive failure turned out to be a simple $5 faulty SATA cable. This experience was a great reminder to never assume a drive is dead without digging into the SMART data first. A high UDMA_CRC_Error_Count is a dead giveaway for connection issues, not drive failure.

The scariest part of this ordeal was that I only discovered it by chance when I logged into the dashboard. A silent failure is the worst kind. In my next post, I'll document how I set up email and push notifications in TrueNAS Scale so the server can scream for help the moment something goes wrong, long before a drive gets kicked out of the pool.