How To Replace A Failing SVM Disk

Before you replace (what you believe is) a failing Solaris Volume Manager (SVM) disk, you need to establish whether it is still in fact in the process of failing or it has already failed. Why is it important to determine if an SVM disk has failed? It could save you a little time replacing a failed SVM disk as opposed to a failing one.

Read How To Tell The Difference Between A Failed Disk And A Failing Disk to find out which one your disk is. If your disk indeed has failed, this article will show you How To Replace A Failed SVM Disk.

In this example, two disks, c1t0d0 and c1t1d0, are mirrored to each other using Solaris Volume Manager. c1t1d0 is showing signs of impending failure and has to be replaced.

# tail /var/adm/messages
Jan  1 03:11:19 solaris_1 scsi: [ID 107833 kern.warning] WARNING: /pci@1c,600000/scsi@2/sd@1,0 (sd1):
Jan  1 03:11:19 solaris_1  Error for Command: write(10)               Error Level: Retryable
Jan  1 03:11:19 solaris_1 scsi: [ID 107833 kern.notice]    Requested Block: 37782714                  Error Block: 37782714
Jan  1 03:11:19 solaris_1 scsi: [ID 107833 kern.notice]    Vendor: SEAGATE                            Serial Number: 0344A6E4EG
Jan  1 03:11:19 solaris_1 scsi: [ID 107833 kern.notice]    Sense Key: Unit Attention
Jan  1 03:11:19 solaris_1 scsi: [ID 107833 kern.notice]    ASC: 0x29 (bus device reset message occurred), ASCQ: 0x3, FRU: 0x4
#
# iostat -En c1t1d0
c1t1d0           Soft Errors: 0 Hard Errors: 28473 Transport Errors: 107662
Vendor: SEAGATE  Product: ST336607LSUN36G  Revision: 0307 Serial No: 0344A6E4EG
Size: 36.42GB <36418595328 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 28473 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
#

Find out if the failing SVM disk contains metadatabase replicas and delete them.

metadb | grep c1t1d0
     a        u         16              8192            /dev/dsk/c1t1d0s7
     a        u         8208            8192            /dev/dsk/c1t1d0s7
     a        u         16400           8192            /dev/dsk/c1t1d0s7
#
# metadb -d c1t1d0s7
#
# metadb
        flags           first blk\tblock count
     a m  p  luo        16              8192            /dev/dsk/c1t0d0s7
     a    p  luo        8208            8192            /dev/dsk/c1t0d0s7
     a    p  luo        16400           8192            /dev/dsk/c1t0d0s7
#

Detach the submirrors in the failing SVM disk.

# metastat -p
d0 -m d10 d20 1
d10 1 1 c1t0d0s0
d20 1 1 c1t1d0s0
d6 -m d16 d26 1
d16 1 1 c1t0d0s6
d26 1 1 c1t1d0s6
d5 -m d15 d25 1
d15 1 1 c1t0d0s5
d25 1 1 c1t1d0s5
d4 -m d14 d24 1
d14 1 1 c1t0d0s4
d24 1 1 c1t1d0s4
d3 -m d13 d23 1
d13 1 1 c1t0d0s3
d23 1 1 c1t1d0s3
d1 -m d11 d21 1
d11 1 1 c1t0d0s1
d21 1 1 c1t1d0s1
#
# metastat -p | grep c1t1d0
d20 1 1 c1t1d0s0
d26 1 1 c1t1d0s6
d25 1 1 c1t1d0s5
d24 1 1 c1t1d0s4
d23 1 1 c1t1d0s3
d21 1 1 c1t1d0s1
#
# metadetach d0 d20
d0: submirror d20 is detached
solaris_1# metadetach d6 d26
d6: submirror d26 is detached
solaris_1# metadetach d5 d25
d5: submirror d25 is detached
solaris_1# metadetach d4 d24
d4: submirror d24 is detached
solaris_1# metadetach d3 d23
d3: submirror d23 is detached
solaris_1# metadetach d1 d21
d1: submirror d21 is detached
#
# metastat -p
d0 -m d10 1
d10 1 1 c1t0d0s0
d6 -m d16 1
d16 1 1 c1t0d0s6
d5 -m d15 1
d15 1 1 c1t0d0s5
d4 -m d14 1
d14 1 1 c1t0d0s4
d3 -m d13 1
d13 1 1 c1t0d0s3
d1 -m d11 1
d11 1 1 c1t0d0s1
d20 1 1 c1t1d0s0
d26 1 1 c1t1d0s6
d25 1 1 c1t1d0s5
d24 1 1 c1t1d0s4
d23 1 1 c1t1d0s3
d21 1 1 c1t1d0s1
#

Remove the detached submirrors from the SVM metadatabase.

# metaclear d20
d20: Concat/Stripe is cleared
# metaclear d26
d26: Concat/Stripe is cleared
# metaclear d25
d25: Concat/Stripe is cleared
# metaclear d24
d24: Concat/Stripe is cleared
# metaclear d23
d23: Concat/Stripe is cleared
# metaclear d21
d21: Concat/Stripe is cleared
#
# metastat -p
d0 -m d10 1
d10 1 1 c1t0d0s0
d6 -m d16 1
d16 1 1 c1t0d0s6
d5 -m d15 1
d15 1 1 c1t0d0s5
d4 -m d14 1
d14 1 1 c1t0d0s4
d3 -m d13 1
d13 1 1 c1t0d0s3
d1 -m d11 1
d11 1 1 c1t0d0s1
#

Verify that all SVM objects have been removed from the failing disk.

# metastat -p | grep c1t1d0
#
# metadb | grep c1t1d0
#

Unconfigure the failing SVM disk

# cfgadm -al
Ap_Id                          Type         Receptacle   Occupant     Condition
c0                             scsi-bus     connected    configured   unknown
c0::dsk/c0t0d0                 CD-ROM       connected    configured   unknown
c1                             scsi-bus     connected    configured   unknown
c1::dsk/c1t0d0                 disk         connected    configured   unknown
c1::dsk/c1t1d0                 disk         connected    configured   unknown
c2                             scsi-bus     connected    unconfigured unknown
usb0/1                         unknown      empty        unconfigured ok
usb0/2                         unknown      empty        unconfigured ok
#
# cfgadm -c unconfigure c1::dsk/c1t1d0
cfgadm: Component system is busy, try again: failed to offline:
Resource              Information
------------------  -------------------------
/dev/dsk/c1t1d0s2   Device being used by VxVM
#

Note: This host uses SVM to manage internal disks and Veritas Volume Manager (VxVM) to manage SAN attached disks. VxVM keeps track of the internal disks – even if it doesn’t actually manage them – and may not allow you to unconfigure them. To get around this restriction, you may need to forcibly unconfigure the failing SVM disk by specifying the -f parameter to cfgadm.

# cfgadm -f -c unconfigure c1::dsk/c1t1d0
#
# cfgadm -al
Ap_Id                          Type         Receptacle   Occupant     Condition
c0                             scsi-bus     connected    configured   unknown
c0::dsk/c0t0d0                 CD-ROM       connected    configured   unknown
c1                             scsi-bus     connected    configured   unknown
c1::dsk/c1t0d0                 disk         connected    configured   unknown
c1::dsk/c1t1d0                 disk         connected    unconfigured unknown
c2                             scsi-bus     connected    unconfigured unknown
usb0/1                         unknown      empty        unconfigured ok
usb0/2                         unknown      empty        unconfigured ok
#

Verify that the failing SVM disk is marked “unconfigured” as above. Sun servers with hot-swappable disks will also have the disk’s blue “ready to remove” LED lit.

Pull the failing SVM disk out of the drive bay. You will see a message similar to this if you tail -f /var/adm/messages.

Jan  6 12:24:14 solaris_1 rmclomv: [ID 545013 kern.error] DISK @ HDD1 has been removed.

Insert the new disk. The following message will come up in /var/adm/messages.

Jan  6 12:24:50 solaris_1 rmclomv: [ID 978967 kern.error] DISK @ HDD1 has been inserted.

Configure the new disk.

# cfgadm -c configure c1::dsk/c1t1d0
#
# cfgadm -al
Ap_Id                          Type         Receptacle   Occupant     Condition
c0                             scsi-bus     connected    configured   unknown
c0::dsk/c0t0d0                 CD-ROM       connected    configured   unknown
c1                             scsi-bus     connected    configured   unknown
c1::dsk/c1t0d0                 disk         connected    configured   unknown
c1::dsk/c1t1d0                 disk         connected    configured   unknown
c2                             scsi-bus     connected    unconfigured unknown
usb0/1                         unknown      empty        unconfigured ok
usb0/2                         unknown      empty        unconfigured ok
#

Verify that the new disk has been configured as above.

Copy the volume table of contents (VTOC) from the other disk in the mirror set, c1t0d0, onto the new disk.

# prtvtoc /dev/rdsk/c1t0d0s2 | fmthard -s - /dev/rdsk/c1t1d0s2
fmthard:  New volume table of contents now in place.
#

If prtvtoc returns with an error similar to this, “/dev/rdsk/c1t1d0s2: Cannot get disk geometry“, you will need to run format to label the disk.

# format
Searching for disks...done

c1t1d0: configured with capacity of 136.71GB

AVAILABLE DISK SELECTIONS:
       0. c1t0d0
          /pci@780/pci@0/pci@9/scsi@0/sd@0,0
       1. c1t1d0
          /pci@780/pci@0/pci@9/scsi@0/sd@1,0
       2. c1t2d0
          /pci@780/pci@0/pci@9/scsi@0/sd@2,0
       3. c1t3d0
          /pci@780/pci@0/pci@9/scsi@0/sd@3,0
Specify disk (enter its number): 1
selecting c1t1d0
[disk formatted]
Disk not labeled.  Label it now? y

FORMAT MENU:
        disk       - select a disk
        type       - select (define) a disk type
        partition  - select (define) a partition table
        current    - describe the current disk
        format     - format and analyze the disk
        repair     - repair a defective sector
        label      - write label to the disk
        analyze    - surface analysis
        defect     - defect list management
        backup     - search for backup labels
        verify     - read and display labels
        save       - save new disk/partition definitions
        inquiry    - show vendor, product and revision
        volname    - set 8-character volume name
        !     - execute , then return
        quit
format> q

#

Recreate the metadatabase replicas on the new disk.

# metadb -a -c 3 c1t1d0s7
#
# metadb
        flags           first blk       block count
     a m  p  luo        16              8192            /dev/dsk/c1t0d0s7
     a    p  luo        8208            8192            /dev/dsk/c1t0d0s7
     a    p  luo        16400           8192            /dev/dsk/c1t0d0s7
     a        u         16              8192            /dev/dsk/c1t1d0s7
     a        u         8208            8192            /dev/dsk/c1t1d0s7
     a        u         16400           8192            /dev/dsk/c1t1d0s7
#

Initialize the SVM submirrors on the new disk.

# metainit d21 1 1 c1t1d0s1
d21: Concat/Stripe is setup
# metainit d23 1 1 c1t1d0s3
d23: Concat/Stripe is setup
# metainit d24 1 1 c1t1d0s4
d24: Concat/Stripe is setup
# metainit d25 1 1 c1t1d0s5
d25: Concat/Stripe is setup
# metainit d26 1 1 c1t1d0s6
d26: Concat/Stripe is setup
# metainit d20 1 1 c1t1d0s0
d20: Concat/Stripe is setup
#
# metastat -p
d0 -m d10 1
d10 1 1 c1t0d0s0
d6 -m d16 1
d16 1 1 c1t0d0s6
d5 -m d15 1
d15 1 1 c1t0d0s5
d4 -m d14 1
d14 1 1 c1t0d0s4
d3 -m d13 1
d13 1 1 c1t0d0s3
d1 -m d11 1
d11 1 1 c1t0d0s1
d20 1 1 c1t1d0s0
d26 1 1 c1t1d0s6
d25 1 1 c1t1d0s5
d24 1 1 c1t1d0s4
d23 1 1 c1t1d0s3
d21 1 1 c1t1d0s1
#

Attach the new submirrors.

# metattach d1 d21
d1: submirror d21 is attached
# metattach d3 d23
d3: submirror d23 is attached
# metattach d4 d24
d4: submirror d24 is attached
# metattach d5 d25
d5: submirror d25 is attached
# metattach d6 d26
d6: submirror d26 is attached
# metattach d0 d20
d0: submirror d20 is attached
#
# metastat -p
d0 -m d10 d20 1
d10 1 1 c1t0d0s0
d20 1 1 c1t1d0s0
d6 -m d16 d26 1
d16 1 1 c1t0d0s6
d26 1 1 c1t1d0s6
d5 -m d15 d25 1
d15 1 1 c1t0d0s5
d25 1 1 c1t1d0s5
d4 -m d14 d24 1
d14 1 1 c1t0d0s4
d24 1 1 c1t1d0s4
d3 -m d13 d23 1
d13 1 1 c1t0d0s3
d23 1 1 c1t1d0s3
d1 -m d11 d21 1
d11 1 1 c1t0d0s1
d21 1 1 c1t1d0s1
#

Update the new disk’s device ID entry in SVM. This step may not be required but it’s a good idea to do it just in case.

# metadevadm -u c1t1d0
Updating Solaris Volume Manager device relocation information for c1t1d0
Old device reloc information:
id1,sd@SSEAGATE_ST336607LSUN36G_3JAX5SL30000731858TJ
New device reloc information:
id1,sd@SSEAGATE_ST336607LSUN36G_3JAX5SL30000731858TJ
#

SVM will resync the submirrors in the new disk as soon as they are attached. This is done in the background and may take a fair amount of time depending on the size of the submirrors. Now is a good time to go for a cup of coffee. Don’t forget to check the progress of the resync when you return.