How To Replace A Failing SVM Disk
Before you replace (what you believe is) a failing Solaris Volume Manager (SVM) disk, you need to establish whether it is still in fact in the process of failing or it has already failed. Why is it important to determine if an SVM disk has failed? It could save you a little time replacing a failed SVM disk as opposed to a failing one.
Read How To Tell The Difference Between A Failed Disk And A Failing Disk to find out which one your disk is. If your disk indeed has failed, this article will show you How To Replace A Failed SVM Disk.
In this example, two disks, c1t0d0 and c1t1d0, are mirrored to each other using Solaris Volume Manager. c1t1d0 is showing signs of impending failure and has to be replaced.
# tail /var/adm/messages Jan 1 03:11:19 solaris_1 scsi: [ID 107833 kern.warning] WARNING: /pci@1c,600000/scsi@2/sd@1,0 (sd1): Jan 1 03:11:19 solaris_1 Error for Command: write(10) Error Level: Retryable Jan 1 03:11:19 solaris_1 scsi: [ID 107833 kern.notice] Requested Block: 37782714 Error Block: 37782714 Jan 1 03:11:19 solaris_1 scsi: [ID 107833 kern.notice] Vendor: SEAGATE Serial Number: 0344A6E4EG Jan 1 03:11:19 solaris_1 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Jan 1 03:11:19 solaris_1 scsi: [ID 107833 kern.notice] ASC: 0x29 (bus device reset message occurred), ASCQ: 0x3, FRU: 0x4 # # iostat -En c1t1d0 c1t1d0 Soft Errors: 0 Hard Errors: 28473 Transport Errors: 107662 Vendor: SEAGATE Product: ST336607LSUN36G Revision: 0307 Serial No: 0344A6E4EG Size: 36.42GB <36418595328 bytes> Media Error: 0 Device Not Ready: 0 No Device: 28473 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 #
Find out if the failing SVM disk contains metadatabase replicas and delete them.
metadb | grep c1t1d0
a u 16 8192 /dev/dsk/c1t1d0s7
a u 8208 8192 /dev/dsk/c1t1d0s7
a u 16400 8192 /dev/dsk/c1t1d0s7
#
# metadb -d c1t1d0s7
#
# metadb
flags first blk\tblock count
a m p luo 16 8192 /dev/dsk/c1t0d0s7
a p luo 8208 8192 /dev/dsk/c1t0d0s7
a p luo 16400 8192 /dev/dsk/c1t0d0s7
#
Detach the submirrors in the failing SVM disk.
# metastat -p d0 -m d10 d20 1 d10 1 1 c1t0d0s0 d20 1 1 c1t1d0s0 d6 -m d16 d26 1 d16 1 1 c1t0d0s6 d26 1 1 c1t1d0s6 d5 -m d15 d25 1 d15 1 1 c1t0d0s5 d25 1 1 c1t1d0s5 d4 -m d14 d24 1 d14 1 1 c1t0d0s4 d24 1 1 c1t1d0s4 d3 -m d13 d23 1 d13 1 1 c1t0d0s3 d23 1 1 c1t1d0s3 d1 -m d11 d21 1 d11 1 1 c1t0d0s1 d21 1 1 c1t1d0s1 # # metastat -p | grep c1t1d0 d20 1 1 c1t1d0s0 d26 1 1 c1t1d0s6 d25 1 1 c1t1d0s5 d24 1 1 c1t1d0s4 d23 1 1 c1t1d0s3 d21 1 1 c1t1d0s1 # # metadetach d0 d20 d0: submirror d20 is detached solaris_1# metadetach d6 d26 d6: submirror d26 is detached solaris_1# metadetach d5 d25 d5: submirror d25 is detached solaris_1# metadetach d4 d24 d4: submirror d24 is detached solaris_1# metadetach d3 d23 d3: submirror d23 is detached solaris_1# metadetach d1 d21 d1: submirror d21 is detached # # metastat -p d0 -m d10 1 d10 1 1 c1t0d0s0 d6 -m d16 1 d16 1 1 c1t0d0s6 d5 -m d15 1 d15 1 1 c1t0d0s5 d4 -m d14 1 d14 1 1 c1t0d0s4 d3 -m d13 1 d13 1 1 c1t0d0s3 d1 -m d11 1 d11 1 1 c1t0d0s1 d20 1 1 c1t1d0s0 d26 1 1 c1t1d0s6 d25 1 1 c1t1d0s5 d24 1 1 c1t1d0s4 d23 1 1 c1t1d0s3 d21 1 1 c1t1d0s1 #
Remove the detached submirrors from the SVM metadatabase.
# metaclear d20 d20: Concat/Stripe is cleared # metaclear d26 d26: Concat/Stripe is cleared # metaclear d25 d25: Concat/Stripe is cleared # metaclear d24 d24: Concat/Stripe is cleared # metaclear d23 d23: Concat/Stripe is cleared # metaclear d21 d21: Concat/Stripe is cleared # # metastat -p d0 -m d10 1 d10 1 1 c1t0d0s0 d6 -m d16 1 d16 1 1 c1t0d0s6 d5 -m d15 1 d15 1 1 c1t0d0s5 d4 -m d14 1 d14 1 1 c1t0d0s4 d3 -m d13 1 d13 1 1 c1t0d0s3 d1 -m d11 1 d11 1 1 c1t0d0s1 #
Verify that all SVM objects have been removed from the failing disk.
# metastat -p | grep c1t1d0 # # metadb | grep c1t1d0 #
Unconfigure the failing SVM disk
# cfgadm -al Ap_Id Type Receptacle Occupant Condition c0 scsi-bus connected configured unknown c0::dsk/c0t0d0 CD-ROM connected configured unknown c1 scsi-bus connected configured unknown c1::dsk/c1t0d0 disk connected configured unknown c1::dsk/c1t1d0 disk connected configured unknown c2 scsi-bus connected unconfigured unknown usb0/1 unknown empty unconfigured ok usb0/2 unknown empty unconfigured ok # # cfgadm -c unconfigure c1::dsk/c1t1d0 cfgadm: Component system is busy, try again: failed to offline: Resource Information ------------------ ------------------------- /dev/dsk/c1t1d0s2 Device being used by VxVM #
Note: This host uses SVM to manage internal disks and Veritas Volume Manager (VxVM) to manage SAN attached disks. VxVM keeps track of the internal disks – even if it doesn’t actually manage them – and may not allow you to unconfigure them. To get around this restriction, you may need to forcibly unconfigure the failing SVM disk by specifying the -f parameter to cfgadm.
# cfgadm -f -c unconfigure c1::dsk/c1t1d0 # # cfgadm -al Ap_Id Type Receptacle Occupant Condition c0 scsi-bus connected configured unknown c0::dsk/c0t0d0 CD-ROM connected configured unknown c1 scsi-bus connected configured unknown c1::dsk/c1t0d0 disk connected configured unknown c1::dsk/c1t1d0 disk connected unconfigured unknown c2 scsi-bus connected unconfigured unknown usb0/1 unknown empty unconfigured ok usb0/2 unknown empty unconfigured ok #
Verify that the failing SVM disk is marked “unconfigured” as above. Sun servers with hot-swappable disks will also have the disk’s blue “ready to remove” LED lit.
Pull the failing SVM disk out of the drive bay. You will see a message similar to this if you tail -f /var/adm/messages.
Jan 6 12:24:14 solaris_1 rmclomv: [ID 545013 kern.error] DISK @ HDD1 has been removed.
Insert the new disk. The following message will come up in /var/adm/messages.
Jan 6 12:24:50 solaris_1 rmclomv: [ID 978967 kern.error] DISK @ HDD1 has been inserted.
Configure the new disk.
# cfgadm -c configure c1::dsk/c1t1d0 # # cfgadm -al Ap_Id Type Receptacle Occupant Condition c0 scsi-bus connected configured unknown c0::dsk/c0t0d0 CD-ROM connected configured unknown c1 scsi-bus connected configured unknown c1::dsk/c1t0d0 disk connected configured unknown c1::dsk/c1t1d0 disk connected configured unknown c2 scsi-bus connected unconfigured unknown usb0/1 unknown empty unconfigured ok usb0/2 unknown empty unconfigured ok #
Verify that the new disk has been configured as above.
Copy the volume table of contents (VTOC) from the other disk in the mirror set, c1t0d0, onto the new disk.
# prtvtoc /dev/rdsk/c1t0d0s2 | fmthard -s - /dev/rdsk/c1t1d0s2 fmthard: New volume table of contents now in place. #
If prtvtoc returns with an error similar to this, “/dev/rdsk/c1t1d0s2: Cannot get disk geometry“, you will need to run format to label the disk.
# format
Searching for disks...done
c1t1d0: configured with capacity of 136.71GB
AVAILABLE DISK SELECTIONS:
0. c1t0d0
/pci@780/pci@0/pci@9/scsi@0/sd@0,0
1. c1t1d0
/pci@780/pci@0/pci@9/scsi@0/sd@1,0
2. c1t2d0
/pci@780/pci@0/pci@9/scsi@0/sd@2,0
3. c1t3d0
/pci@780/pci@0/pci@9/scsi@0/sd@3,0
Specify disk (enter its number): 1
selecting c1t1d0
[disk formatted]
Disk not labeled. Label it now? y
FORMAT MENU:
disk - select a disk
type - select (define) a disk type
partition - select (define) a partition table
current - describe the current disk
format - format and analyze the disk
repair - repair a defective sector
label - write label to the disk
analyze - surface analysis
defect - defect list management
backup - search for backup labels
verify - read and display labels
save - save new disk/partition definitions
inquiry - show vendor, product and revision
volname - set 8-character volume name
! - execute , then return
quit
format> q
#
Recreate the metadatabase replicas on the new disk.
# metadb -a -c 3 c1t1d0s7
#
# metadb
flags first blk block count
a m p luo 16 8192 /dev/dsk/c1t0d0s7
a p luo 8208 8192 /dev/dsk/c1t0d0s7
a p luo 16400 8192 /dev/dsk/c1t0d0s7
a u 16 8192 /dev/dsk/c1t1d0s7
a u 8208 8192 /dev/dsk/c1t1d0s7
a u 16400 8192 /dev/dsk/c1t1d0s7
#
Initialize the SVM submirrors on the new disk.
# metainit d21 1 1 c1t1d0s1 d21: Concat/Stripe is setup # metainit d23 1 1 c1t1d0s3 d23: Concat/Stripe is setup # metainit d24 1 1 c1t1d0s4 d24: Concat/Stripe is setup # metainit d25 1 1 c1t1d0s5 d25: Concat/Stripe is setup # metainit d26 1 1 c1t1d0s6 d26: Concat/Stripe is setup # metainit d20 1 1 c1t1d0s0 d20: Concat/Stripe is setup # # metastat -p d0 -m d10 1 d10 1 1 c1t0d0s0 d6 -m d16 1 d16 1 1 c1t0d0s6 d5 -m d15 1 d15 1 1 c1t0d0s5 d4 -m d14 1 d14 1 1 c1t0d0s4 d3 -m d13 1 d13 1 1 c1t0d0s3 d1 -m d11 1 d11 1 1 c1t0d0s1 d20 1 1 c1t1d0s0 d26 1 1 c1t1d0s6 d25 1 1 c1t1d0s5 d24 1 1 c1t1d0s4 d23 1 1 c1t1d0s3 d21 1 1 c1t1d0s1 #
Attach the new submirrors.
# metattach d1 d21 d1: submirror d21 is attached # metattach d3 d23 d3: submirror d23 is attached # metattach d4 d24 d4: submirror d24 is attached # metattach d5 d25 d5: submirror d25 is attached # metattach d6 d26 d6: submirror d26 is attached # metattach d0 d20 d0: submirror d20 is attached # # metastat -p d0 -m d10 d20 1 d10 1 1 c1t0d0s0 d20 1 1 c1t1d0s0 d6 -m d16 d26 1 d16 1 1 c1t0d0s6 d26 1 1 c1t1d0s6 d5 -m d15 d25 1 d15 1 1 c1t0d0s5 d25 1 1 c1t1d0s5 d4 -m d14 d24 1 d14 1 1 c1t0d0s4 d24 1 1 c1t1d0s4 d3 -m d13 d23 1 d13 1 1 c1t0d0s3 d23 1 1 c1t1d0s3 d1 -m d11 d21 1 d11 1 1 c1t0d0s1 d21 1 1 c1t1d0s1 #
Update the new disk’s device ID entry in SVM. This step may not be required but it’s a good idea to do it just in case.
# metadevadm -u c1t1d0 Updating Solaris Volume Manager device relocation information for c1t1d0 Old device reloc information: id1,sd@SSEAGATE_ST336607LSUN36G_3JAX5SL30000731858TJ New device reloc information: id1,sd@SSEAGATE_ST336607LSUN36G_3JAX5SL30000731858TJ #
SVM will resync the submirrors in the new disk as soon as they are attached. This is done in the background and may take a fair amount of time depending on the size of the submirrors. Now is a good time to go for a cup of coffee. Don’t forget to check the progress of the resync when you return.
Hi ,
you did a fatastic job by posting all this stuff. It would be so helpful for the guys like me who are newbies to solaris……keep it up man. Thanks for ur efforts
Happy to help, Nagendra. This site is a work in progress, though I haven’t had a chance to devote time to it lately, so feel free to come back for more new stuff.
U r Real brother dude!!!!!!!!!
bravo
thx,very useful for me
HI
Yes it is really a fantastic job . You made a great effort in publishing this. This is very clear and easy to understand.
Thanks for you effort . Looking forward for the articles like this.
Thanks Ariel,
You are the best. Your job is very understanding and clear.You are really Solaris system admin.
Hey Ariel this is very good post and it saves my time understanding too.
Thank you,
its really help to me…good job dude…thanking u very much…
This is really good job well done.u just made us a checklist.we need more of these,thanks.
Its really an amazing info and well done bro!!!!
I want to highlight a key point, after i replace the HDD , one week later i found the server was rebooted this is because of not using the “# metadevadm -u c1t1d0″
Thanks bro for once again!!
Looking forward for many articles!