SATA Drives exception PHYRdyChg timeout
The hardware is a Fujtsu professional server and the hosting provider was rather helpless after we exchanged drives, cables and two times the whole server. The kernel is a current RedHat Enterprise 5.5 kernel. The SMART-data of the disks showed no errors.
The kernel logged these messages while the freeze:
kernel: ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen kernel: ata1.00: irq_stat 0x00400000, PHY RDY changed kernel: ata1: SError: { PHYRdyChg } kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 kernel: res 40/00:04:7e:d9:78/00:00:05:00:00/40 Emask 0x10 (ATA bus error) kernel: ata1.00: status: { DRDY } kernel: ata1: hard resetting link kernel: ata1: link is slow to respond, please be patient (ready=0) kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) kernel: ata1.00: qc timeout (cmd 0xec) kernel: ata1.00: failed to IDENTIFY (I/O error, err_mask=0x5) kernel: ata1.00: revalidation failed (errno=-5) kernel: ata1: failed to recover some devices, retrying in 5 secs kernel: ata1: hard resetting link kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) kernel: ata1.00: configured for UDMA/133 kernel: sd 0:0:0:0: timing out command, waited 30s kernel: ata1: EH complete kernel: SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) kernel: sda: Write Protect is off kernel: SCSI device sda: drive cache: write back
I feared raid inconsistencies and filesystem/database curruption during this complete freezing. After hours of analysing and google'ing for this non-reproducible and dangerous behaviour i found this RedHat bug report. This report claims WD-Drives having serious problems with NCQ (native command queuing) - which is a performance feature of modern disks. Jeff Burke - who tracked down this issue - said: "So, apparently Western Digital don't have a clue how to do NCQ... I think these drives should be permanently blacklisted in kernel's libata.c, with NCQ on they are an absolute menace."
Since the disks in use were also from Western Digital i patched their disk-ids into the ncq-blacklist of the kernel:
--- a/drivers/ata/libata-core.c 2010-05-20 20:39:08.000000000 +0200 +++ b/drivers/ata/libata-core.c 2010-05-20 20:43:54.000000000 +0200 @@ -3924,6 +3924,7 @@ { "Maxtor 7V300F0", "VA111630", ATA_HORKAGE_NONCQ }, { "ST380817AS", "3.42", ATA_HORKAGE_NONCQ }, { "ST3160023AS", "3.42", ATA_HORKAGE_NONCQ }, + { "WDC WD2502ABYS-5*", NULL, ATA_HORKAGE_NONCQ }, /* Blacklist entries taken from Silicon Image 3124/3132 Windows driver .inf file - also several Linux problem reports */With this patch, which disables using NCQ with these drives, the problem never appeared again. To verify the patch has disabled NCQ for the disks dmesg says NCQ (not used):
ata1.00: ATA-8: WDC WD2502ABYS-50B7A0, 02.03B04, max UDMA/133 ata1.00: 488397168 sectors, multi 16: LBA48 NCQ (not used) ata1.00: configured for UDMA/133Fortunately i had no performance degration due to the disabled NCQ.