SLab:Md1k-2 disk problems

From CCGB
Revision as of 15:08, 26 May 2010 by Phalenor (talk | contribs)

Jump to: navigation, search

Dell support info

Dell Higher Education Support
1-800-274-7799
Enter Express Service Code 43288304365 when prompted on the call.
host service tag contract end description notes
c8 CPM1NF1 02/15/2011 PowerEdge 1950 old schuster storage, moved PERC 5/E to s3
s3 GHNCVH1 06/22/2012 PowerEdge 1950 connected to md1k-2 via PERC 5/E
md1k-1 FVWQLF1 02/07/2011 PowerVault MD1000 enclosure 3
md1k-2 JVWQLF1 02/07/2011 PowerVault MD1000 enclosure 2
md1k-3 4X9NLF1 03/06/2011 PowerVault MD1000 enclosure 1

New disk problems

Starting May 20, 2010, the following problems have been detected after starting from an error-free state:

  • md1k-1, slot 3 - PERC marked disk as failed. reinserting the disk and importing the foreign config brought it back online - (May 20)
  • md1k-2, slot 11 went missing, md1k-2 slot 12 firmware state is "FAILED" - (May 22)
  • md1k-2, slot 3 "Error f0" - (May 25)
  • md1k-2, slot 3 "Error f0" - (May 26, morning)
  • md1k-2, slot 9, "Error f0" while slot 14 (hot-spare) was rebuilding - (May 26, mid afternoon)

old disk problems

The following things need to be done:

  1. Copy the data from md1k-2 to md1k-2 (see Copying data below).
  2. Call Dell to resolve this problem.
  3. Adjust links in /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/archive to reflect new data location.

It seems unlikely that so many disks would fail so close together. I'm thinking it might be a problem with the PowerVault MD1000 itself.

Description of problem we're having with md1k-2

I've narrowed down the error by looking through the adapter event logs. There will be two errors, one followed about 45 seconds after the first:

Tue Mar 23 23:09:56 2010   Error on PD 31(e2/s0) (Error f0)
Tue Mar 23 23:10:43 2010   Error on PD 30(e2/s1) (Error f0)

After that, the virtual disk goes offline.

[root@s3: ~/storage]# MegaCli -LDInfo L1 -a0

Adapter 0 -- Virtual Drive Information:
Virtual Disk: 1 (Target Id: 1)
Name:md1k-2
RAID Level: Primary-5, Secondary-0, RAID Level Qualifier-3
Size:8.862 TB
State: Offline
Stripe Size: 64 KB
Number Of Drives:14
Span Depth:1
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default
Encryption Type: None
Number of Dedicated Hot Spares: 1
    0 : EnclId - 18 SlotId - 1 

If you go into the server room, you'll see flashing amber lights on the disks in md1k-2 slot 0 and 1. I can md1k-2 back using the following procedure (replace slot 0 and 1 which whichever slots are appropriate):

  1. Take the disks in md1k-2 slots 0 and 1 about half-way out and then push them back in.
  2. Wait a few seconds for the lights on md1k-2 slots 0 and 1 to return to green.
  3. Press the power button on s3 until it turns off.
  4. Press the power button on s3 again to turn it back on.
  5. Log into s3 as root after it has finished booting.

I then import the foreign configuration. The disk in md1k-2 slot 1 is fine, but the disk in slot 0 needs to be rebuilt. It takes about 4 hours to rebuild a drive.

[root@s3: ~/storage]# MegaCli -CfgForeign -Scan -a0
                                     
There are 2 foreign configuration(s) on controller 0.

Exit Code: 0x00
[root@s3: ~/storage]# MegaCli -CfgForeign -Import -a0
                                     
Foreign configuration is imported on controller 0.

Exit Code: 0x00
[root@s3: ~/storage]# ./check_disk_states 
PhysDrv [ 18:0 ] in md1k-2 is in Rebuild state.

command to check rebuild progress:
MegaCli -PDRbld -ShowProg -PhysDrv [ 18:0 ] -a0

command to estimate remaining rebuild time:
./time_left 18 0
[root@s3: ~/storage]# MegaCli -PDRbld -ShowProg -PhysDrv [ 18:0 ] -a0
                                     
Rebuild Progress on Device at Enclosure 18, Slot 0 Completed 56% in 139 Minutes.

Exit Code: 0x00
[root@s3: ~/storage]# ./time_left 18 0
time_left: PhysDrv [ 18:0 ] will be done rebuilding in about 1:49:12

After the disk is finished rebuilding, reboot s3 and md1k-2 will be available once again.

Log entries containing Error f0

Extracted from the PERC 5/E event log entries:

Tue Mar 9 15:02:11 2010 Error on PD 1d(e1/s4) (Error f0)
Tue Mar 9 15:02:58 2010 Error on PD 1c(e1/s5) (Error f0)
Fri Mar 12 15:00:25 2010 Error on PD 30(e2/s1) (Error f0)
Fri Mar 12 15:01:12 2010 Error on PD 2f(e2/s2) (Error f0)
Wed Mar 17 03:55:03 2010 Error on PD 30(e2/s1) (Error f0)
Wed Mar 17 03:55:50 2010 Error on PD 2f(e2/s2) (Error f0)
Sat Mar 20 11:53:06 2010 Error on PD 29(e2/s8) (Error f0)
Sat Mar 20 11:53:54 2010 Error on PD 28(e2/s9) (Error f0)
Tue Mar 23 19:55:39 2010 Error on PD 28(e2/s9) (Error f0)
Tue Mar 23 19:56:21 2010 Error on PD 27(e2/s10) (Error f0)
Tue Mar 23 23:09:56 2010 Error on PD 31(e2/s0) (Error f0)
Tue Mar 23 23:10:43 2010 Error on PD 30(e2/s1) (Error f0)
Wed Mar 24 22:02:34 2010 Error on PD 29(e2/s8) (Error f0)
Wed Mar 24 22:03:17 2010 Error on PD 28(e2/s9) (Error f0)
Thu Mar 25 22:08:01 2010 Error on PD 31(e2/s0) (Error f0)
Thu Mar 25 22:08:44 2010 Error on PD 23(e2/s14) (Error f0)
Thu Mar 25 22:08:44 2010 Error on PD 30(e2/s1) (Error f0)
Fri Mar 26 20:02:13 2010 Error on PD 31(e2/s0) (Error f0)
Fri Mar 26 20:04:06 2010 Error on PD 23(e2/s14) (Error f0)
Fri Mar 26 20:04:06 2010 Error on PD 30(e2/s1) (Error f0)
Mon Mar 29 17:00:30 2010 Error on PD 31(e2/s0) (Error f0)
Mon Mar 29 17:01:13 2010 Error on PD 23(e2/s14) (Error f0)
Mon Mar 29 17:01:13 2010 Error on PD 30(e2/s1) (Error f0)
Tue Mar 30 00:00:45 2010 Error on PD 2e(e2/s3) (Error f0)
Tue Mar 30 00:01:56 2010 Error on PD 2d(e2/s4) (Error f0)
Tue Mar 30 00:01:56 2010 Error on PD 30(e2/s1) (Error f0)

Copying data from md1k-2 to md1k-1

zpool md1k-2 contains zfs filesystem md1k-2/archive which is mounted on s3:/zfs/md1k-2/archive. s3:/zfs/md1k-2/archive contains archives of 454 runs from May 2009 through December 2009 and all of the Illumina runs from 2009.

We're copying the data from md1k-2 to md1k-1. There are about 7TB of data on md1k-2. Only 729GB has been copied to md1k-1 so far.

First, I make a few changes to facilitate the data copy:

[root@s3: ~/storage]# zfs unshare md1k-2/data
[root@s3: ~/storage]# zfs unshare md1k-2/archive
[root@s3: ~/storage]# zfs unshare md1k-1/archive
[root@s3: ~/storage]# zpool set failmode=continue md1k-2
[root@s3: ~/storage]# zfs set checksum=off md1k-2/archive

Then, I use rsync within a screen to copy the data:

[root@s3: ~/storage]# screen
[root@s3: ~/storage]# cd /zfs/md1k-2/archive
[root@s3: ~/storage]# rsync -Pav sequencing /zfs/md1k-1/archive

After the rsync starts copying the data, I type control-a control-d to detach the screen session (that's control-a followed by a control-d).

TO reconnect to the screen session:

[root@s3: ~/storage]# screen -r


When copying is finished, undo the previous changes:

[root@s3: ~/storage]# zfs set checksum=on md1k-2/archive
[root@s3: ~/storage]# zpool set failmode=wait md1k-2

After all of the data has been copied, you'll need to delete and re-create the md1k-2 zpool.