How To Tell The Difference Between A Failed Disk And A Failing Disk

Why is it important to distinguish between a failed disk and one that is still in the process of failing? Because knowing if a disk has failed may save you a few steps when it’s time to replace it.

In this example, two disks, c1t0d0 and c1t1d0, are mirrored to each other using Solaris Volume Manager (SVM). c1t1d0 is showing signs of impending failure or has failed already, as the case may be. Here are the differences.

A failing disk is still visible in the format command.

AVAILABLE DISK SELECTIONS:
       0. c1t0d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>
          /pci@780/pci@0/pci@9/scsi@0/sd@0,0
       1. c1t1d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>
          /pci@780/pci@0/pci@9/scsi@0/sd@1,0
       2. c1t2d0 <SEAGATE-ST973402SSUN72G-0400-68.37GB>
          /pci@780/pci@0/pci@9/scsi@0/sd@2,0
       3. c1t3d0 <SEAGATE-ST973402SSUN72G-0400-68.37GB>
          /pci@780/pci@0/pci@9/scsi@0/sd@3,0
Specify disk (enter its number):

A failed disk is marked “drive not available”.

AVAILABLE DISK SELECTIONS:
       0. c1t0d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
          /pci@1f,700000/scsi@2/sd@0,0
       1. c1t1d0 <drive not available>
          /pci@1f,700000/scsi@2/sd@1,0
       2. c1t2d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
          /pci@1f,700000/scsi@2/sd@2,0
       3. c1t3d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
          /pci@1f,700000/scsi@2/sd@3,0
Specify disk (enter its number):

A failing disk will show read or write errors in /var/adm/messages.

Jan  1 03:11:19 solaris_1 scsi: [ID 107833 kern.warning] WARNING: /pci@1c,600000/scsi@2/sd@1,0 (sd1):
Jan  1 03:11:19 solaris_1  Error for Command: write(10)               Error Level: Retryable
Jan  1 03:11:19 solaris_1 scsi: [ID 107833 kern.notice]    Requested Block: 37782714                  Error Block: 37782714
Jan  1 03:11:19 solaris_1 scsi: [ID 107833 kern.notice]    Vendor: SEAGATE                            Serial Number: 0344A6E4EG
Jan  1 03:11:19 solaris_1 scsi: [ID 107833 kern.notice]    Sense Key: Unit Attention
Jan  1 03:11:19 solaris_1 scsi: [ID 107833 kern.notice]    ASC: 0x29 (bus device reset message occurred), ASCQ: 0x3, FRU: 0x4

A failed disk just won’t respond:

Jul 19 11:21:59 solaris_1 scsi: [ID 107833 kern.warning] WARNING: /pci@1f,700000/scsi@2/sd@1,0 (sd2):
Jul 19 11:21:59 solaris_1   disk not responding to selection
Jul 19 11:22:01 solaris_1 scsi: [ID 107833 kern.warning] WARNING: /pci@1f,700000/scsi@2/sd@1,0 (sd2):
Jul 19 11:22:01 solaris_1   disk not responding to selection

A failing disk will show an increase in the number of hard and transport errors over time.

# iostat -En c1t1d0
c1t1d0           Soft Errors: 0 Hard Errors: 28473 Transport Errors: 107662
Vendor: SEAGATE  Product: ST336607LSUN36G  Revision: 0307 Serial No: 0344A6E4EG
Size: 36.42GB <36418595328 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 28473 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
#

A failed disk will only show an increase in the number of transport errors.

# iostat -En c1t1d0
c1t1d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 18
Vendor: FUJITSU  Product: MAW3073NCSUN72G  Revision: 1703 Serial No: 0708B0KP9L
Size: 73.40GB <73400057856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
#

Now that you’re sure that the disk has failed, you can skip the metadevice related operations – metadetach and metaclear – that must be done prior to replacing a failing disk. In fact, SVM will not allow you to do metadetach and metaclear on a failed disk.

# metadetach d0 d20
metadetach: solaris_1: d0: attempt an operation on a submirror that has erred components
#

You will still need to delete the state database replicas. But that’s one command compared to the dozen or so metadetaches and metaclears that you need to do for a failing disk.

Read these articles for instructions on How To Replace A Failed SVM Disk and How To Replace A Failing SVM Disk.