|
|
Line 1: |
Line 1: |
− | == CRITICAL DISK STORAGE PROBLEM WITH MD1K-2 ==
| + | * [[slab:md1k-2 disk problems]] |
− | | |
− | There is a serious problem with md1k-2, one of the PowerVault MD1000's connected to s3. It will get into a state where two disks appear to have failed. A two-disk failure when using RAID-5 would mean complete data loss. Fortunately, I've found a remedy that is allowing us to copy the data to a different array.
| |
− | | |
− | <pre>
| |
− | Dell Higher Education Support
| |
− | 1-800-274-7799
| |
− | Enter Express Service Code 43288304365 when prompted on the call.
| |
− | </pre>
| |
− | | |
− | {| border="1"
| |
− | |host
| |
− | |service tag
| |
− | |contract end
| |
− | |description
| |
− | |notes
| |
− | |-
| |
− | |c8
| |
− | |CPM1NF1
| |
− | |02/15/2011
| |
− | |PowerEdge 1950
| |
− | |old schuster storage, moved PERC 5/E to s3
| |
− | |-
| |
− | |s3
| |
− | |GHNCVH1
| |
− | |06/22/2012
| |
− | |PowerEdge 1950
| |
− | |connected to md1k-2 via PERC 5/E
| |
− | |-
| |
− | |md1k-1
| |
− | |FVWQLF1
| |
− | |02/07/2011
| |
− | |PowerVault MD1000
| |
− | |enclosure 3
| |
− | |-
| |
− | |md1k-2
| |
− | |JVWQLF1
| |
− | |02/07/2011
| |
− | |PowerVault MD1000
| |
− | |enclosure 2
| |
− | |-
| |
− | |md1k-3
| |
− | |4X9NLF1
| |
− | |03/06/2011
| |
− | |PowerVault MD1000
| |
− | |enclosure 1
| |
− | |}
| |
− | | |
− | The following things need to be done:
| |
− | # Copy the data from md1k-2 to md1k-2 (see [[#Copying_data_from_md1k-2_to_md1k-1|Copying data]] below).
| |
− | # Call Dell to resolve this problem.
| |
− | # Adjust links in /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/archive to reflect new data location.
| |
− | | |
− | It seems unlikely that so many disks would fail so close together. I'm thinking it might be a problem with the PowerVault MD1000 itself.
| |
− | | |
− | === Description of problem we're having with md1k-2 ===
| |
− | | |
− | | |
− | I've narrowed down the error by looking through the adapter event logs. There will be two errors, one followed about 45 seconds after the first:
| |
− | <pre>
| |
− | Tue Mar 23 23:09:56 2010 Error on PD 31(e2/s0) (Error f0)
| |
− | Tue Mar 23 23:10:43 2010 Error on PD 30(e2/s1) (Error f0)
| |
− | </pre>
| |
− | | |
− | After that, the virtual disk goes offline.
| |
− | | |
− | <pre>
| |
− | [root@s3: ~/storage]# MegaCli -LDInfo L1 -a0
| |
− | | |
− | Adapter 0 -- Virtual Drive Information:
| |
− | Virtual Disk: 1 (Target Id: 1)
| |
− | Name:md1k-2
| |
− | RAID Level: Primary-5, Secondary-0, RAID Level Qualifier-3
| |
− | Size:8.862 TB
| |
− | State: Offline
| |
− | Stripe Size: 64 KB
| |
− | Number Of Drives:14
| |
− | Span Depth:1
| |
− | Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
| |
− | Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
| |
− | Access Policy: Read/Write
| |
− | Disk Cache Policy: Disk's Default
| |
− | Encryption Type: None
| |
− | Number of Dedicated Hot Spares: 1
| |
− | 0 : EnclId - 18 SlotId - 1
| |
− | </pre>
| |
− | | |
− | If you go into the server room, you'll see flashing amber lights on the disks in md1k-2 slot 0 and 1. I can md1k-2 back using the following procedure (replace slot 0 and 1 which whichever slots are appropriate):
| |
− | | |
− | # Take the disks in md1k-2 slots 0 and 1 about half-way out and then push them back in.
| |
− | # Wait a few seconds for the lights on md1k-2 slots 0 and 1 to return to green.
| |
− | # Press the power button on s3 until it turns off.
| |
− | # Press the power button on s3 again to turn it back on.
| |
− | # Log into s3 as root after it has finished booting.
| |
− | | |
− | I then import the foreign configuration. The disk in md1k-2 slot 1 is fine, but the disk in slot 0 needs to be rebuilt. It takes about 4 hours to rebuild a drive.
| |
− | <pre>
| |
− | [root@s3: ~/storage]# MegaCli -CfgForeign -Scan -a0 | |
− |
| |
− | There are 2 foreign configuration(s) on controller 0.
| |
− | | |
− | Exit Code: 0x00
| |
− | [root@s3: ~/storage]# MegaCli -CfgForeign -Import -a0
| |
− |
| |
− | Foreign configuration is imported on controller 0.
| |
− | | |
− | Exit Code: 0x00
| |
− | [root@s3: ~/storage]# ./check_disk_states | |
− | PhysDrv [ 18:0 ] in md1k-2 is in Rebuild state.
| |
− | | |
− | command to check rebuild progress:
| |
− | MegaCli -PDRbld -ShowProg -PhysDrv [ 18:0 ] -a0
| |
− | | |
− | command to estimate remaining rebuild time:
| |
− | ./time_left 18 0
| |
− | [root@s3: ~/storage]# MegaCli -PDRbld -ShowProg -PhysDrv [ 18:0 ] -a0
| |
− |
| |
− | Rebuild Progress on Device at Enclosure 18, Slot 0 Completed 56% in 139 Minutes.
| |
− | | |
− | Exit Code: 0x00
| |
− | [root@s3: ~/storage]# ./time_left 18 0
| |
− | time_left: PhysDrv [ 18:0 ] will be done rebuilding in about 1:49:12
| |
− | </pre>
| |
− | | |
− | After the disk is finished rebuilding, reboot s3 and md1k-2 will be available once again.
| |
− | | |
− | === Log entries containing Error f0 ===
| |
− | | |
− | Extracted from the PERC 5/E event log entries:
| |
− |
| |
− | {| border="1"
| |
− | |Tue Mar 9 15:02:11 2010
| |
− | |Error on PD 1d(e1/s4) (Error f0)
| |
− | |-
| |
− | |Tue Mar 9 15:02:58 2010
| |
− | |Error on PD 1c(e1/s5) (Error f0)
| |
− | |-
| |
− | |Fri Mar 12 15:00:25 2010
| |
− | |Error on PD 30(e2/s1) (Error f0)
| |
− | |-
| |
− | |Fri Mar 12 15:01:12 2010
| |
− | |Error on PD 2f(e2/s2) (Error f0)
| |
− | |-
| |
− | |Wed Mar 17 03:55:03 2010
| |
− | |Error on PD 30(e2/s1) (Error f0)
| |
− | |-
| |
− | |Wed Mar 17 03:55:50 2010
| |
− | |Error on PD 2f(e2/s2) (Error f0)
| |
− | |-
| |
− | |Sat Mar 20 11:53:06 2010
| |
− | |Error on PD 29(e2/s8) (Error f0)
| |
− | |-
| |
− | |Sat Mar 20 11:53:54 2010
| |
− | |Error on PD 28(e2/s9) (Error f0)
| |
− | |-
| |
− | |Tue Mar 23 19:55:39 2010
| |
− | |Error on PD 28(e2/s9) (Error f0)
| |
− | |-
| |
− | |Tue Mar 23 19:56:21 2010
| |
− | |Error on PD 27(e2/s10) (Error f0)
| |
− | |-
| |
− | |Tue Mar 23 23:09:56 2010
| |
− | |Error on PD 31(e2/s0) (Error f0)
| |
− | |-
| |
− | |Tue Mar 23 23:10:43 2010
| |
− | |Error on PD 30(e2/s1) (Error f0)
| |
− | |-
| |
− | |Wed Mar 24 22:02:34 2010
| |
− | |Error on PD 29(e2/s8) (Error f0)
| |
− | |-
| |
− | |Wed Mar 24 22:03:17 2010
| |
− | |Error on PD 28(e2/s9) (Error f0)
| |
− | |-
| |
− | |Thu Mar 25 22:08:01 2010
| |
− | |Error on PD 31(e2/s0) (Error f0)
| |
− | |-
| |
− | |Thu Mar 25 22:08:44 2010
| |
− | |Error on PD 23(e2/s14) (Error f0)
| |
− | |-
| |
− | |Thu Mar 25 22:08:44 2010
| |
− | |Error on PD 30(e2/s1) (Error f0)
| |
− | |-
| |
− | |Fri Mar 26 20:02:13 2010
| |
− | |Error on PD 31(e2/s0) (Error f0)
| |
− | |-
| |
− | |Fri Mar 26 20:04:06 2010
| |
− | |Error on PD 23(e2/s14) (Error f0)
| |
− | |-
| |
− | |Fri Mar 26 20:04:06 2010
| |
− | |Error on PD 30(e2/s1) (Error f0)
| |
− | |-
| |
− | |Mon Mar 29 17:00:30 2010
| |
− | |Error on PD 31(e2/s0) (Error f0)
| |
− | |-
| |
− | |Mon Mar 29 17:01:13 2010
| |
− | |Error on PD 23(e2/s14) (Error f0)
| |
− | |-
| |
− | |Mon Mar 29 17:01:13 2010
| |
− | |Error on PD 30(e2/s1) (Error f0)
| |
− | |-
| |
− | |Tue Mar 30 00:00:45 2010
| |
− | |Error on PD 2e(e2/s3) (Error f0)
| |
− | |-
| |
− | |Tue Mar 30 00:01:56 2010
| |
− | |Error on PD 2d(e2/s4) (Error f0)
| |
− | |-
| |
− | |Tue Mar 30 00:01:56 2010
| |
− | |Error on PD 30(e2/s1) (Error f0)
| |
− | |-
| |
− | |}
| |
− | | |
− | === Copying data from md1k-2 to md1k-1 ===
| |
− | | |
− | zpool md1k-2 contains zfs filesystem md1k-2/archive which is mounted on s3:/zfs/md1k-2/archive.
| |
− | s3:/zfs/md1k-2/archive contains archives of 454 runs from May 2009 through December 2009 and all of the Illumina runs from 2009.
| |
− | | |
− | We're copying the data from md1k-2 to md1k-1. There are about 7TB of data on md1k-2. Only 729GB has been copied to md1k-1 so far.
| |
− | | |
− | First, I make a few changes to facilitate the data copy:
| |
− | <pre>
| |
− | [root@s3: ~/storage]# zfs unshare md1k-2/data
| |
− | [root@s3: ~/storage]# zfs unshare md1k-2/archive
| |
− | [root@s3: ~/storage]# zfs unshare md1k-1/archive
| |
− | [root@s3: ~/storage]# zpool set failmode=continue md1k-2
| |
− | [root@s3: ~/storage]# zfs set checksum=off md1k-2/archive
| |
− | </pre>
| |
− | | |
− | Then, I use rsync within a screen to copy the data:
| |
− | <pre>
| |
− | [root@s3: ~/storage]# screen
| |
− | [root@s3: ~/storage]# cd /zfs/md1k-2/archive
| |
− | [root@s3: ~/storage]# rsync -Pav sequencing /zfs/md1k-1/archive
| |
− | </pre>
| |
− | | |
− | After the rsync starts copying the data, I type <code>control-a control-d</code> to detach the screen session (that's control-a followed by a control-d).
| |
− | | |
− | TO reconnect to the screen session:
| |
− | <pre>
| |
− | [root@s3: ~/storage]# screen -r
| |
− | </pre>
| |
− | | |
− | | |
− | When copying is finished, undo the previous changes:
| |
− | <pre>
| |
− | [root@s3: ~/storage]# zfs set checksum=on md1k-2/archive
| |
− | [root@s3: ~/storage]# zpool set failmode=wait md1k-2
| |
− | </pre>
| |
− | | |
− | After all of the data has been copied, you'll need to delete and re-create the md1k-2 zpool.
| |
− | | |
| == Miscellaneous == | | == Miscellaneous == |
| | | |
The following list is in no particular order.