Difference between revisions of "SLab:Todo"

Revision as of 11:50, 26 March 2010

CRITICAL DISK STORAGE PROBLEM WITH MD1K-2

There is a serious problem with md1k-2, one of the PowerVault MD1000's connected to s3. It will get into a state where two disks appear to have failed. A two-disk failure when using RAID-5 would mean complete data loss. Fortunately, I've found a remedy that is allowing us to copy the data to a different array.

Dell Higher Education Support
1-800-274-7799
Enter Express Service Code 43288304365 when prompted on the call.

host	service tag	contract end	description	notes
c8	CPM1NF1	02/15/2011	PowerEdge 1950	old schuster storage, moved PERC 5/E to s3
s3	GHNCVH1	06/22/2012	PowerEdge 1950	connected to md1k-2 via PERC 5/E
md1k-1	FVWQLF1	02/07/2011	PowerVault MD1000	enclosure 3
md1k-2	JVWQLF1	02/07/2011	PowerVault MD1000	enclosure 2
md1k-3	4X9NLF1	03/06/2011	PowerVault MD1000	enclosure 1

The following things need to be done:

Copy the data from md1k-2 to md1k-2 (see Copying data below).
Call Dell to resolve this problem.
Adjust links in /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/archive to reflect new data location.

It seems unlikely that so many disks would fail so close together. I'm thinking it might be a problem with the PowerVault MD1000 itself.

Description of problem we're having with md1k-2

I've narrowed down the error by looking through the adapter event logs. There will be two errors, one followed about 45 seconds after the first:

Tue Mar 23 23:09:56 2010   Error on PD 31(e2/s0) (Error f0)
Tue Mar 23 23:10:43 2010   Error on PD 30(e2/s1) (Error f0)

After that, the virtual disk goes offline.

[root@s3: ~/storage]# MegaCli -LDInfo L1 -a0

Adapter 0 -- Virtual Drive Information:
Virtual Disk: 1 (Target Id: 1)
Name:md1k-2
RAID Level: Primary-5, Secondary-0, RAID Level Qualifier-3
Size:8.862 TB
State: Offline
Stripe Size: 64 KB
Number Of Drives:14
Span Depth:1
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default
Encryption Type: None
Number of Dedicated Hot Spares: 1
    0 : EnclId - 18 SlotId - 1

If you go into the server room, you'll see flashing amber lights on the disks in md1k-2 slot 0 and 1. I can md1k-2 back using the following procedure (replace slot 0 and 1 which whichever slots are appropriate):

Take the disks in md1k-2 slots 0 and 1 about half-way out and then push them back in.
Wait a few seconds for the lights on md1k-2 slots 0 and 1 to return to green.
Press the power button on s3 until it turns off.
Press the power button on s3 again to turn it back on.
Log into s3 as root after it has finished booting.

I then import the foreign configuration. The disk in md1k-2 slot 1 is fine, but the disk in slot 0 needs to be rebuild.

[root@s3: ~/storage]# MegaCli -CfgForeign -Scan -a0
                                     
There are 2 foreign configuration(s) on controller 0.

Exit Code: 0x00
[root@s3: ~/storage]# MegaCli -CfgForeign -Import -a0
                                     
Foreign configuration is imported on controller 0.

Exit Code: 0x00
[root@s3: ~/storage]# ./check_disk_states 
PhysDrv [ 18:0 ] in md1k-2 is in Rebuild state.

command to check rebuild progress:
MegaCli -PDRbld -ShowProg -PhysDrv [ 18:0 ] -a0

command to estimate remaining rebuild time:
./time_left 18 0
[root@s3: ~/storage]# MegaCli -PDRbld -ShowProg -PhysDrv [ 18:0 ] -a0
                                     
Rebuild Progress on Device at Enclosure 18, Slot 0 Completed 56% in 139 Minutes.

Exit Code: 0x00
[root@s3: ~/storage]# ./time_left 18 0
time_left: PhysDrv [ 18:0 ] will be done rebuilding in about 1:49:12

After the disk is finished rebuilding, reboot s3 and md1k-2 will be available once again.

Log entries containing Error f0

Extracted from the PERC 5/E event log entries:

Tue Mar 9 15:02:11 2010	Error on PD 1d(e1/s4) (Error f0)
Tue Mar 9 15:02:58 2010	Error on PD 1c(e1/s5) (Error f0)
Fri Mar 12 15:00:25 2010	Error on PD 30(e2/s1) (Error f0)
Fri Mar 12 15:01:12 2010	Error on PD 2f(e2/s2) (Error f0)
Wed Mar 17 03:55:03 2010	Error on PD 30(e2/s1) (Error f0)
Wed Mar 17 03:55:50 2010	Error on PD 2f(e2/s2) (Error f0)
Sat Mar 20 11:53:06 2010	Error on PD 29(e2/s8) (Error f0)
Sat Mar 20 11:53:54 2010	Error on PD 28(e2/s9) (Error f0)
Tue Mar 23 19:55:39 2010	Error on PD 28(e2/s9) (Error f0)
Tue Mar 23 19:56:21 2010	Error on PD 27(e2/s10) (Error f0)
Tue Mar 23 23:09:56 2010	Error on PD 31(e2/s0) (Error f0)
Tue Mar 23 23:10:43 2010	Error on PD 30(e2/s1) (Error f0)
Wed Mar 24 22:02:34 2010	Error on PD 29(e2/s8) (Error f0)
Wed Mar 24 22:03:17 2010	Error on PD 28(e2/s9) (Error f0)
Thu Mar 25 22:08:01 2010	Error on PD 31(e2/s0) (Error f0)
Thu Mar 25 22:08:44 2010	Error on PD 23(e2/s14) (Error f0)
Thu Mar 25 22:08:44 2010	Error on PD 30(e2/s1) (Error f0)

Copying data from md1k-2 to md1k-1

zpool md1k-2 contains zfs filesystem md1k-2/archive which is mounted on s3:/zfs/md1k-2/archive. s3:/zfs/md1k-2/archive contains archives of 454 runs from May 2009 through December 2009 and all of the Illumina runs from 2009.

We're copying the data from md1k-2 to md1k-1. There are about 7TB of data on md1k-2. Only 250GB has been copied to md1k-1 so far. Once md1k-2 is back online (should be about 1:30pm Friday March 26), I'll continue copying the data to md1k-1.

First, I make a few changes to facilitate the data copy:

[root@s3: ~/storage]# zfs unshare md1k-2/data
[root@s3: ~/storage]# zfs unshare md1k-2/archive
[root@s3: ~/storage]# zfs unshare md1k-1/archive
[root@s3: ~/storage]# zpool set failmode=continue md1k-2
[root@s3: ~/storage]# zfs set checksum=off md1k-2/archive

Then, I use rsync within a screen to copy the data:

[root@s3: ~/storage]# screen
[root@s3: ~/storage]# cd /zfs/md1k-2/archive
[root@s3: ~/storage]# rsync -Pav sequencing /zfs/md1k-1/archive

After the rsync starts copying the data, I type control-a control-d to detach the screen session (that's control-a followed by a control-d).

TO reconnect to the screen session:

[root@s3: ~/storage]# screen -r

Miscellaneous

The following is a list of things that need to be done (in no particular order).

Migrate linne to bx network.
Migrate Schuster Lab machines to bx network.
- Or at least, install AFS client.
Automate the archiving of sequencing run directories.
- Maybe after two weeks in staging they're moved into the archive?
Test suspending user jobs in Sun Grid Engine.
- Use Xen or something so user jobs can be stopped and not just suspended.
- User jobs could be suspended to allow signal processing jobs to run immediately.
Combine linne and persephone clusters.
Come up with some backup solution.
Replace BioTeam iNquiry
- Use Galaxy instead?
Implement a centralized database of sequencing run information.
- Basically a small LIMS.
- Maybe use Galaxy?

@@ Line 223: / Line 223: @@
 ** Maybe after two weeks in staging they're moved into the archive?
 * Test suspending user jobs in Sun Grid Engine.
-** Use Xen or something so they can be stopped and not just suspended.
+** Use Xen or something so user jobs can be stopped and not just suspended.
-** This way, the persephone cluster could be used for arbitrary jobs.
+** User jobs could be suspended to allow signal processing jobs to run immediately.
-*** Jobs would be suspended so signal processing jobs could be run immediately.
 * Combine linne and persephone clusters.
 * Come up with some backup solution.
 * Replace BioTeam iNquiry
+** Use Galaxy instead?
 * Implement a centralized database of sequencing run information.
 ** Basically a small LIMS.
 ** Maybe use Galaxy?

Difference between revisions of "SLab:Todo"

Revision as of 11:50, 26 March 2010

Contents

CRITICAL DISK STORAGE PROBLEM WITH MD1K-2

Description of problem we're having with md1k-2

Log entries containing Error f0

Copying data from md1k-2 to md1k-1

Miscellaneous

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools