Axel's root Blog

for nerds only - little stories from the everyday sysadmin life with problems and their hard-to-find solutions

SATA Drives exception PHYRdyChg timeout

2010-05-24 by Axel Reinhold, tagged as hardware, linux
Between two and five times a month i had sata kernel exceptions with 40 seconds timeouts and freezing the server completely.

The hardware is a Fujtsu professional server and the hosting provider was rather helpless after we exchanged drives, cables and two times the whole server. The kernel is a current RedHat Enterprise 5.5 kernel. The SMART-data of the disks showed no errors.

The kernel logged these messages while the freeze:

kernel: ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
kernel: ata1.00: irq_stat 0x00400000, PHY RDY changed
kernel: ata1: SError: { PHYRdyChg }
kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
kernel:          res 40/00:04:7e:d9:78/00:00:05:00:00/40 Emask 0x10 (ATA bus error)
kernel: ata1.00: status: { DRDY }
kernel: ata1: hard resetting link
kernel: ata1: link is slow to respond, please be patient (ready=0)
kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
kernel: ata1.00: qc timeout (cmd 0xec)
kernel: ata1.00: failed to IDENTIFY (I/O error, err_mask=0x5)
kernel: ata1.00: revalidation failed (errno=-5)
kernel: ata1: failed to recover some devices, retrying in 5 secs
kernel: ata1: hard resetting link
kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
kernel: ata1.00: configured for UDMA/133
kernel: sd 0:0:0:0: timing out command, waited 30s
kernel: ata1: EH complete
kernel: SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
kernel: sda: Write Protect is off
kernel: SCSI device sda: drive cache: write back

I feared raid inconsistencies and filesystem/database curruption during this complete freezing. After hours of analysing and google'ing for this non-reproducible and dangerous behaviour i found this RedHat bug report. This report claims WD-Drives having serious problems with NCQ (native command queuing) - which is a performance feature of modern disks. Jeff Burke - who tracked down this issue - said: "So, apparently Western Digital don't have a clue how to do NCQ... I think these drives should be permanently blacklisted in kernel's libata.c, with NCQ on they are an absolute menace."

Since the disks in use were also from Western Digital i patched their disk-ids into the ncq-blacklist of the kernel:

--- a/drivers/ata/libata-core.c 2010-05-20 20:39:08.000000000 +0200
+++ b/drivers/ata/libata-core.c 2010-05-20 20:43:54.000000000 +0200
@@ -3924,6 +3924,7 @@
        { "Maxtor 7V300F0",     "VA111630",     ATA_HORKAGE_NONCQ },
        { "ST380817AS",         "3.42",         ATA_HORKAGE_NONCQ },
        { "ST3160023AS",        "3.42",         ATA_HORKAGE_NONCQ },
+       { "WDC WD2502ABYS-5*",  NULL,           ATA_HORKAGE_NONCQ },

        /* Blacklist entries taken from Silicon Image 3124/3132
           Windows driver .inf file - also several Linux problem reports */
With this patch, which disables using NCQ with these drives, the problem never appeared again. To verify the patch has disabled NCQ for the disks dmesg says NCQ (not used):
ata1.00: ATA-8: WDC WD2502ABYS-50B7A0, 02.03B04, max UDMA/133
ata1.00: 488397168 sectors, multi 16: LBA48 NCQ (not used)
ata1.00: configured for UDMA/133
Fortunately i had no performance degration due to the disabled NCQ.