Difference between revisions of "SLab:Todo"

Revision as of 18:12, 18 May 2010

slab:md1k-2 disk problems

Miscellaneous

The following list is in no particular order.

Migrate linne to bx network.
Migrate Schuster Lab machines to bx network.
- Or at least, install AFS client.
Automate the archiving of sequencing run directories.
- Maybe after two weeks in staging they're moved into the archive?
Test suspending user jobs in Sun Grid Engine.
- Use Xen or something so user jobs can be stopped and not just suspended.
- User jobs could be suspended to allow signal processing jobs to run immediately.
Combine linne and persephone clusters.
Come up with some backup solution.
Replace BioTeam iNquiry
- Use Galaxy instead?
Implement a centralized database of sequencing run information.
- Basically a small LIMS.
- Maybe use Galaxy?
After problems with md1k-2 are fixed, turn on automated scrubbing.
Look into different storage systems.
clean up old files in /afs/bx.psu.edu/depot/data/schuster_lab/old_stuff_to_cleanup

Difference between revisions of "SLab:Todo"

Revision as of 18:12, 18 May 2010

Miscellaneous

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 1: / Line 1: @@
-== CRITICAL DISK STORAGE PROBLEM WITH MD1K-2 ==
+* [[slab:md1k-2 disk problems]]
-There is a serious problem with md1k-2, one of the PowerVault MD1000's connected to s3.  It will get into a state where two disks appear to have failed.  A two-disk failure when using RAID-5 would mean complete data loss.  Fortunately, I've found a remedy that is allowing us to copy the data to a different array.
-<pre>
-Dell Higher Education Support
--800-274-7799
-Enter Express Service Code 43288304365 when prompted on the call.
-</pre>
-{| border="1"
-|host
-|service tag
-|contract end
-|description
-|notes
-|-
-|c8
-|CPM1NF1
-|02/15/2011
-|PowerEdge 1950
-|old schuster storage, moved PERC 5/E to s3
-|-
-|s3
-|GHNCVH1
-|06/22/2012
-|PowerEdge 1950
-|connected to md1k-2 via PERC 5/E
-|-
-|md1k-1
-|FVWQLF1
-|02/07/2011
-|PowerVault MD1000
-|enclosure 3
-|-
-|md1k-2
-|JVWQLF1
-|02/07/2011
-|PowerVault MD1000
-|enclosure 2
-|-
-|md1k-3
-|4X9NLF1
-|03/06/2011
-|PowerVault MD1000
-|enclosure 1
-|}
-The following things need to be done:
-# Copy the data from md1k-2 to md1k-2 (see [[#Copying_data_from_md1k-2_to_md1k-1|Copying data]] below).
-# Call Dell to resolve this problem.
-# Adjust links in /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/archive to reflect new data location.
-It seems unlikely that so many disks would fail so close together.  I'm thinking it might be a problem with the PowerVault MD1000 itself.
-=== Description of problem we're having with md1k-2 ===
-I've narrowed down the error by looking through the adapter event logs.  There will be two errors, one followed about 45 seconds after the first:
-<pre>
-Tue Mar 23 23:09:56 2010   Error on PD 31(e2/s0) (Error f0)
-Tue Mar 23 23:10:43 2010   Error on PD 30(e2/s1) (Error f0)
-</pre>
-After that, the virtual disk goes offline.
-<pre>
-[root@s3: ~/storage]# MegaCli -LDInfo L1 -a0
-Adapter 0 -- Virtual Drive Information:
-Virtual Disk: 1 (Target Id: 1)
-Name:md1k-2
-RAID Level: Primary-5, Secondary-0, RAID Level Qualifier-3
-Size:8.862 TB
-State: Offline
-Stripe Size: 64 KB
-Number Of Drives:14
-Span Depth:1
-Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
-Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
-Access Policy: Read/Write
-Disk Cache Policy: Disk's Default
-Encryption Type: None
-Number of Dedicated Hot Spares: 1
-: EnclId - 18 SlotId - 1
-</pre>
-If you go into the server room, you'll see flashing amber lights on the disks in md1k-2 slot 0 and 1.  I can md1k-2 back using the following procedure (replace slot 0 and 1 which whichever slots are appropriate):
-# Take the disks in md1k-2 slots 0 and 1 about half-way out and then push them back in.
-# Wait a few seconds for the lights on md1k-2 slots 0 and 1 to return to green.
-# Press the power button on s3 until it turns off.
-# Press the power button on s3 again to turn it back on.
-# Log into s3 as root after it has finished booting.
-I then import the foreign configuration.  The disk in md1k-2 slot 1 is fine, but the disk in slot 0 needs to be rebuilt.  It takes about 4 hours to rebuild a drive.
-<pre>
-[root@s3: ~/storage]# MegaCli -CfgForeign -Scan -a0
-There are 2 foreign configuration(s) on controller 0.
-Exit Code: 0x00
-[root@s3: ~/storage]# MegaCli -CfgForeign -Import -a0
-Foreign configuration is imported on controller 0.
-Exit Code: 0x00
-[root@s3: ~/storage]# ./check_disk_states
-PhysDrv [ 18:0 ] in md1k-2 is in Rebuild state.
-command to check rebuild progress:
-MegaCli -PDRbld -ShowProg -PhysDrv [ 18:0 ] -a0
-command to estimate remaining rebuild time:
-./time_left 18 0
-[root@s3: ~/storage]# MegaCli -PDRbld -ShowProg -PhysDrv [ 18:0 ] -a0
-Rebuild Progress on Device at Enclosure 18, Slot 0 Completed 56% in 139 Minutes.
-Exit Code: 0x00
-[root@s3: ~/storage]# ./time_left 18 0
-time_left: PhysDrv [ 18:0 ] will be done rebuilding in about 1:49:12
-</pre>
-After the disk is finished rebuilding, reboot s3 and md1k-2 will be available once again.
-=== Log entries containing Error f0 ===
-Extracted from the PERC 5/E event log entries:
-{| border="1"
-|Tue Mar  9 15:02:11 2010
-|Error on PD 1d(e1/s4) (Error f0)
-|-
-|Tue Mar  9 15:02:58 2010
-|Error on PD 1c(e1/s5) (Error f0)
-|-
-|Fri Mar 12 15:00:25 2010
-|Error on PD 30(e2/s1) (Error f0)
-|-
-|Fri Mar 12 15:01:12 2010
-|Error on PD 2f(e2/s2) (Error f0)
-|-
-|Wed Mar 17 03:55:03 2010
-|Error on PD 30(e2/s1) (Error f0)
-|-
-|Wed Mar 17 03:55:50 2010
-|Error on PD 2f(e2/s2) (Error f0)
-|-
-|Sat Mar 20 11:53:06 2010
-|Error on PD 29(e2/s8) (Error f0)
-|-
-|Sat Mar 20 11:53:54 2010
-|Error on PD 28(e2/s9) (Error f0)
-|-
-|Tue Mar 23 19:55:39 2010
-|Error on PD 28(e2/s9) (Error f0)
-|-
-|Tue Mar 23 19:56:21 2010
-|Error on PD 27(e2/s10) (Error f0)
-|-
-|Tue Mar 23 23:09:56 2010
-|Error on PD 31(e2/s0) (Error f0)
-|-
-|Tue Mar 23 23:10:43 2010
-|Error on PD 30(e2/s1) (Error f0)
-|-
-|Wed Mar 24 22:02:34 2010
-|Error on PD 29(e2/s8) (Error f0)
-|-
-|Wed Mar 24 22:03:17 2010
-|Error on PD 28(e2/s9) (Error f0)
-|-
-|Thu Mar 25 22:08:01 2010
-|Error on PD 31(e2/s0) (Error f0)
-|-
-|Thu Mar 25 22:08:44 2010
-|Error on PD 23(e2/s14) (Error f0)
-|-
-|Thu Mar 25 22:08:44 2010
-|Error on PD 30(e2/s1) (Error f0)
-|-
-|Fri Mar 26 20:02:13 2010
-|Error on PD 31(e2/s0) (Error f0)
-|-
-|Fri Mar 26 20:04:06 2010
-|Error on PD 23(e2/s14) (Error f0)
-|-
-|Fri Mar 26 20:04:06 2010
-|Error on PD 30(e2/s1) (Error f0)
-|-
-|Mon Mar 29 17:00:30 2010
-|Error on PD 31(e2/s0) (Error f0)
-|-
-|Mon Mar 29 17:01:13 2010
-|Error on PD 23(e2/s14) (Error f0)
-|-
-|Mon Mar 29 17:01:13 2010
-|Error on PD 30(e2/s1) (Error f0)
-|-
-|Tue Mar 30 00:00:45 2010
-|Error on PD 2e(e2/s3) (Error f0)
-|-
-|Tue Mar 30 00:01:56 2010
-|Error on PD 2d(e2/s4) (Error f0)
-|-
-|Tue Mar 30 00:01:56 2010
-|Error on PD 30(e2/s1) (Error f0)
-|-
-|}
-=== Copying data from md1k-2 to md1k-1 ===
-zpool md1k-2 contains zfs filesystem md1k-2/archive which is mounted on s3:/zfs/md1k-2/archive.
-s3:/zfs/md1k-2/archive contains archives of 454 runs from May 2009 through December 2009 and all of the Illumina runs from 2009.
-We're copying the data from md1k-2 to md1k-1.  There are about 7TB of data on md1k-2.  Only 729GB has been copied to md1k-1 so far.
-First, I make a few changes to facilitate the data copy:
-<pre>
-[root@s3: ~/storage]# zfs unshare md1k-2/data
-[root@s3: ~/storage]# zfs unshare md1k-2/archive
-[root@s3: ~/storage]# zfs unshare md1k-1/archive
-[root@s3: ~/storage]# zpool set failmode=continue md1k-2
-[root@s3: ~/storage]# zfs set checksum=off md1k-2/archive
-</pre>
-Then, I use rsync within a screen to copy the data:
-<pre>
-[root@s3: ~/storage]# screen
-[root@s3: ~/storage]# cd /zfs/md1k-2/archive
-[root@s3: ~/storage]# rsync -Pav sequencing /zfs/md1k-1/archive
-</pre>
-After the rsync starts copying the data, I type <code>control-a control-d</code> to detach the screen session (that's control-a followed by a control-d).
-TO reconnect to the screen session:
-<pre>
-[root@s3: ~/storage]# screen -r
-</pre>
-When copying is finished, undo the previous changes:
-<pre>
-[root@s3: ~/storage]# zfs set checksum=on md1k-2/archive
-[root@s3: ~/storage]# zpool set failmode=wait md1k-2
-</pre>
-After all of the data has been copied, you'll need to delete and re-create the md1k-2 zpool.
 == Miscellaneous ==