How To Tell The Difference Between A Failed Disk And A Failing Disk
Why is it important to distinguish between a failed disk and one that is still in the process of failing? Because knowing if a disk has failed may save you a few steps when it’s time to replace it.
In this example, two disks, c1t0d0 and c1t1d0, are mirrored to each other using Solaris Volume Manager (SVM). c1t1d0 is showing signs of impending failure or has failed already, as the case may be. Here are the differences.
A failing disk is still visible in the format command.
AVAILABLE DISK SELECTIONS:
0. c1t0d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>
/pci@780/pci@0/pci@9/scsi@0/sd@0,0
1. c1t1d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>
/pci@780/pci@0/pci@9/scsi@0/sd@1,0
2. c1t2d0 <SEAGATE-ST973402SSUN72G-0400-68.37GB>
/pci@780/pci@0/pci@9/scsi@0/sd@2,0
3. c1t3d0 <SEAGATE-ST973402SSUN72G-0400-68.37GB>
/pci@780/pci@0/pci@9/scsi@0/sd@3,0
Specify disk (enter its number):
A failed disk is marked “drive not available”.
AVAILABLE DISK SELECTIONS:
0. c1t0d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
/pci@1f,700000/scsi@2/sd@0,0
1. c1t1d0 <drive not available>
/pci@1f,700000/scsi@2/sd@1,0
2. c1t2d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
/pci@1f,700000/scsi@2/sd@2,0
3. c1t3d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
/pci@1f,700000/scsi@2/sd@3,0
Specify disk (enter its number):
A failing disk will show read or write errors in /var/adm/messages.
Jan 1 03:11:19 solaris_1 scsi: [ID 107833 kern.warning] WARNING: /pci@1c,600000/scsi@2/sd@1,0 (sd1): Jan 1 03:11:19 solaris_1 Error for Command: write(10) Error Level: Retryable Jan 1 03:11:19 solaris_1 scsi: [ID 107833 kern.notice] Requested Block: 37782714 Error Block: 37782714 Jan 1 03:11:19 solaris_1 scsi: [ID 107833 kern.notice] Vendor: SEAGATE Serial Number: 0344A6E4EG Jan 1 03:11:19 solaris_1 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Jan 1 03:11:19 solaris_1 scsi: [ID 107833 kern.notice] ASC: 0x29 (bus device reset message occurred), ASCQ: 0x3, FRU: 0x4
A failed disk just won’t respond:
Jul 19 11:21:59 solaris_1 scsi: [ID 107833 kern.warning] WARNING: /pci@1f,700000/scsi@2/sd@1,0 (sd2): Jul 19 11:21:59 solaris_1 disk not responding to selection Jul 19 11:22:01 solaris_1 scsi: [ID 107833 kern.warning] WARNING: /pci@1f,700000/scsi@2/sd@1,0 (sd2): Jul 19 11:22:01 solaris_1 disk not responding to selection
A failing disk will show an increase in the number of hard and transport errors over time.
# iostat -En c1t1d0 c1t1d0 Soft Errors: 0 Hard Errors: 28473 Transport Errors: 107662 Vendor: SEAGATE Product: ST336607LSUN36G Revision: 0307 Serial No: 0344A6E4EG Size: 36.42GB <36418595328 bytes> Media Error: 0 Device Not Ready: 0 No Device: 28473 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 #
A failed disk will only show an increase in the number of transport errors.
# iostat -En c1t1d0 c1t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 18 Vendor: FUJITSU Product: MAW3073NCSUN72G Revision: 1703 Serial No: 0708B0KP9L Size: 73.40GB <73400057856 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 #
Now that you’re sure that the disk has failed, you can skip the metadevice related operations – metadetach and metaclear – that must be done prior to replacing a failing disk. In fact, SVM will not allow you to do metadetach and metaclear on a failed disk.
# metadetach d0 d20 metadetach: solaris_1: d0: attempt an operation on a submirror that has erred components #
You will still need to delete the state database replicas. But that’s one command compared to the dozen or so metadetaches and metaclears that you need to do for a failing disk.
Read these articles for instructions on How To Replace A Failed SVM Disk and How To Replace A Failing SVM Disk.