Difference between revisions of "SLab:Todo"

From CCGB
Jump to: navigation, search
(Copying data from md1k-2 to md1k-1)
 
(13 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== CRITICAL DISK STORAGE PROBLEM WITH MD1K-2 ==
+
= in progress =
 +
* [[slab:md1k-2 disk problems|md1k-2 disk problems]] - currently waiting for problems to show up again to get fresh RAID controller log entries, and to verify that switching to the spare EMM (array controller) didn't fix the problem
 +
* nagios monitoring (http://kaylee.bx.psu.edu/nagios  [login as guest/guest])
 +
** who gets notified? everyone all at once, or use elapsed-time-based escalations?
 +
** what do we want to monitor?
 +
*** '''(DONE)''' samba share status on s2 so illumina-ga can copy data!
 +
*** up/down state for all nodes and servers
 +
*** '''(DONE)''' nfs server on s2/s3 and '''PARTIAL''' all the related tcp/udp ports necessary for proper nfs operation
 +
*** disk usage via snmp for s2+s3
 +
*** '''(DONE)''' fault management via FMD over SNMP like we do for afs-fs{4..7}, thumper, saturn....
 +
*** SGE queue status
 +
*** sequencer up/down state, and maybe disk usage (can we do that for illumina-ga remotely somehow?)
 +
* Migrate linne to bx network. See [[Slab:Linne_BX_migration]]
 +
** install AFS client on all nodes
 +
** '''(PARTIAL)''' finish sync'ing uid's/gid's to match what is in BX LDAP
 +
** create BX accounts for those that don't already have them (cleanup/disable linne accounts that are no longer necessary for security reasons?)
 +
** point all nodes to ldap.bx.psu.edu for authZ, and switch to the BX.PSU.EDU krb5 realm for authN
 +
** disable the services running on linne that are no longer necessary
  
There is a serious problem with md1k-2, one of the PowerVault MD1000's connected to s3.  It will get into a state where two disks appear to have failed.  A two-disk failure when using RAID-5 would mean complete data loss.  Fortunately, I've found a remedy that is allowing us to copy the data to a different array.
+
= queued =
 
+
* attach the UPSes to s2 and s3 to enable graceful shutdown in the event of a power outage?
<pre>
+
* automatic snapshots for the ZFS datasets on s2 and s3 (see http://blogs.sun.com/timf/resource/README.zfs-auto-snapshot.txt)
Dell Higher Education Support
+
* more scripts:
1-800-274-7799
+
** migrate sequencing runs from temp to staging (currently /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/support/software/archive/move_*_temp_to_staging)
Enter Express Service Code 43288304365 when prompted on the call.
+
*** perhaps notify by email automatically when there are finished runs ready to be moved?
</pre>
+
*** notify by email when this is done so any interested parties will see that it has been done, and provide paths to new runs
 
+
*** this should call a script to update symlinks and release the data.schuster_lab volume
{| border="1"
+
** script to better handle submitting illumina jobs to cluster, with email notifications
|host
+
** script to allow rsync'ing individual lanes from a run, given source directory, dest directory, and lane(s)
|service tag
 
|contract end
 
|description
 
|notes
 
|-
 
|c8
 
|CPM1NF1
 
|02/15/2011
 
|PowerEdge 1950
 
|old schuster storage, moved PERC 5/E to s3
 
|-
 
|s3
 
|GHNCVH1
 
|06/22/2012
 
|PowerEdge 1950
 
|connected to md1k-2 via PERC 5/E
 
|-
 
|md1k-1
 
|FVWQLF1
 
|02/07/2011
 
|PowerVault MD1000
 
|enclosure 3
 
|-
 
|md1k-2
 
|JVWQLF1
 
|02/07/2011
 
|PowerVault MD1000
 
|enclosure 2
 
|-
 
|md1k-3
 
|4X9NLF1
 
|03/06/2011
 
|PowerVault MD1000
 
|enclosure 1
 
|}
 
 
 
The following things need to be done:
 
# Copy the data from md1k-2 to md1k-2 (see [[#Copying_data_from_md1k-2_to_md1k-1|Copying data]] below).
 
# Call Dell to resolve this problem.
 
# Adjust links in /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/archive to reflect new data location.
 
 
 
It seems unlikely that so many disks would fail so close together.  I'm thinking it might be a problem with the PowerVault MD1000 itself.
 
 
 
=== Description of problem we're having with md1k-2 ===
 
 
 
 
 
I've narrowed down the error by looking through the adapter event logs.  There will be two errors, one followed about 45 seconds after the first:
 
<pre>
 
Tue Mar 23 23:09:56 2010  Error on PD 31(e2/s0) (Error f0)
 
Tue Mar 23 23:10:43 2010  Error on PD 30(e2/s1) (Error f0)
 
</pre>
 
 
 
After that, the virtual disk goes offline.
 
 
 
<pre>
 
[root@s3: ~/storage]# MegaCli -LDInfo L1 -a0
 
 
 
Adapter 0 -- Virtual Drive Information:
 
Virtual Disk: 1 (Target Id: 1)
 
Name:md1k-2
 
RAID Level: Primary-5, Secondary-0, RAID Level Qualifier-3
 
Size:8.862 TB
 
State: Offline
 
Stripe Size: 64 KB
 
Number Of Drives:14
 
Span Depth:1
 
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
 
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
 
Access Policy: Read/Write
 
Disk Cache Policy: Disk's Default
 
Encryption Type: None
 
Number of Dedicated Hot Spares: 1
 
    0 : EnclId - 18 SlotId - 1
 
</pre>
 
 
 
If you go into the server room, you'll see flashing amber lights on the disks in md1k-2 slot 0 and 1.  I can md1k-2 back using the following procedure (replace slot 0 and 1 which whichever slots are appropriate):
 
 
 
# Take the disks in md1k-2 slots 0 and 1 about half-way out and then push them back in.
 
# Wait a few seconds for the lights on md1k-2 slots 0 and 1 to return to green.
 
# Press the power button on s3 until it turns off.
 
# Press the power button on s3 again to turn it back on.
 
# Log into s3 as root after it has finished booting.
 
 
 
I then import the foreign configuration.  The disk in md1k-2 slot 1 is fine, but the disk in slot 0 needs to be rebuild.
 
<pre>
 
[root@s3: ~/storage]# MegaCli -CfgForeign -Scan -a0
 
                                   
 
There are 2 foreign configuration(s) on controller 0.
 
 
 
Exit Code: 0x00
 
[root@s3: ~/storage]# MegaCli -CfgForeign -Import -a0
 
                                   
 
Foreign configuration is imported on controller 0.
 
 
 
Exit Code: 0x00
 
[root@s3: ~/storage]# ./check_disk_states
 
PhysDrv [ 18:0 ] in md1k-2 is in Rebuild state.
 
 
 
command to check rebuild progress:
 
MegaCli -PDRbld -ShowProg -PhysDrv [ 18:0 ] -a0
 
 
 
command to estimate remaining rebuild time:
 
./time_left 18 0
 
[root@s3: ~/storage]# MegaCli -PDRbld -ShowProg -PhysDrv [ 18:0 ] -a0
 
                                   
 
Rebuild Progress on Device at Enclosure 18, Slot 0 Completed 56% in 139 Minutes.
 
 
 
Exit Code: 0x00
 
[root@s3: ~/storage]# ./time_left 18 0
 
time_left: PhysDrv [ 18:0 ] will be done rebuilding in about 1:49:12
 
</pre>
 
 
 
After the disk is finished rebuilding, reboot s3 and md1k-2 will be available once again.
 
 
 
=== Log entries containing Error f0 ===
 
 
 
Extracted from the PERC 5/E event log entries:
 
 
{| border="1"
 
|Tue Mar  9 15:02:11 2010
 
|Error on PD 1d(e1/s4) (Error f0)
 
|-
 
|Tue Mar  9 15:02:58 2010
 
|Error on PD 1c(e1/s5) (Error f0)
 
|-
 
|Fri Mar 12 15:00:25 2010
 
|Error on PD 30(e2/s1) (Error f0)
 
|-
 
|Fri Mar 12 15:01:12 2010
 
|Error on PD 2f(e2/s2) (Error f0)
 
|-
 
|Wed Mar 17 03:55:03 2010
 
|Error on PD 30(e2/s1) (Error f0)
 
|-
 
|Wed Mar 17 03:55:50 2010
 
|Error on PD 2f(e2/s2) (Error f0)
 
|-
 
|Sat Mar 20 11:53:06 2010
 
|Error on PD 29(e2/s8) (Error f0)
 
|-
 
|Sat Mar 20 11:53:54 2010
 
|Error on PD 28(e2/s9) (Error f0)
 
|-
 
|Tue Mar 23 19:55:39 2010
 
|Error on PD 28(e2/s9) (Error f0)
 
|-
 
|Tue Mar 23 19:56:21 2010
 
|Error on PD 27(e2/s10) (Error f0)
 
|-
 
|Tue Mar 23 23:09:56 2010
 
|Error on PD 31(e2/s0) (Error f0)
 
|-
 
|Tue Mar 23 23:10:43 2010
 
|Error on PD 30(e2/s1) (Error f0)
 
|-
 
|Wed Mar 24 22:02:34 2010
 
|Error on PD 29(e2/s8) (Error f0)
 
|-
 
|Wed Mar 24 22:03:17 2010
 
|Error on PD 28(e2/s9) (Error f0)
 
|-
 
|Thu Mar 25 22:08:01 2010
 
|Error on PD 31(e2/s0) (Error f0)
 
|-
 
|Thu Mar 25 22:08:44 2010
 
|Error on PD 23(e2/s14) (Error f0)
 
|-
 
|Thu Mar 25 22:08:44 2010
 
|Error on PD 30(e2/s1) (Error f0)
 
|-
 
|Fri Mar 26 20:02:13 2010
 
|Error on PD 31(e2/s0) (Error f0)
 
|-
 
|Fri Mar 26 20:04:06 2010
 
|Error on PD 23(e2/s14) (Error f0)
 
|-
 
|Fri Mar 26 20:04:06 2010
 
|Error on PD 30(e2/s1) (Error f0)
 
|-
 
|}
 
 
 
=== Copying data from md1k-2 to md1k-1 ===
 
 
 
zpool md1k-2 contains zfs filesystem md1k-2/archive which is mounted on s3:/zfs/md1k-2/archive.
 
s3:/zfs/md1k-2/archive contains archives of 454 runs from May 2009 through December 2009 and all of the Illumina runs from 2009.
 
 
 
We're copying the data from md1k-2 to md1k-1.  There are about 7TB of data on md1k-2.  Only 485GB has been copied to md1k-1 so far.  Once md1k-2 is back online (should be about 1:30pm Monday March 29), I'll continue copying the data to md1k-1.
 
 
 
First, I make a few changes to facilitate the data copy:
 
<pre>
 
[root@s3: ~/storage]# zfs unshare md1k-2/data
 
[root@s3: ~/storage]# zfs unshare md1k-2/archive
 
[root@s3: ~/storage]# zfs unshare md1k-1/archive
 
[root@s3: ~/storage]# zpool set failmode=continue md1k-2
 
[root@s3: ~/storage]# zfs set checksum=off md1k-2/archive
 
</pre>
 
 
 
Then, I use rsync within a screen to copy the data:
 
<pre>
 
[root@s3: ~/storage]# screen
 
[root@s3: ~/storage]# cd /zfs/md1k-2/archive
 
[root@s3: ~/storage]# rsync -Pav sequencing /zfs/md1k-1/archive
 
</pre>
 
 
 
After the rsync starts copying the data, I type <code>control-a control-d</code> to detach the screen session (that's control-a followed by a control-d).
 
 
 
TO reconnect to the screen session:
 
<pre>
 
[root@s3: ~/storage]# screen -r
 
</pre>
 
 
 
 
 
When copying is finished, undo the previous changes:
 
<pre>
 
[root@s3: ~/storage]# zfs set checksum=on md1k-2/archive
 
[root@s3: ~/storage]# zpool set failmode=wait md1k-2
 
</pre>
 
 
 
After all of the data has been copied, you'll need to delete and re-create the md1k-2 zpool.
 
 
 
== Miscellaneous ==
 
 
 
The following list is in no particular order.
 
 
 
* Migrate linne to bx network.
 
 
* Migrate Schuster Lab machines to bx network.
 
* Migrate Schuster Lab machines to bx network.
** Or at least, install AFS client.
+
** Or at least, install AFS client and setup BX.PSU.EDU krb5 realm to handle authentication so it's easier for the schuster lab machines to work with everyone else
 
* Automate the archiving of sequencing run directories.
 
* Automate the archiving of sequencing run directories.
 
** Maybe after two weeks in staging they're moved into the archive?
 
** Maybe after two weeks in staging they're moved into the archive?
* Test suspending user jobs in Sun Grid Engine.
+
** need to keep symlinks up-to-date and release the data.schuster_lab volume appropriately!
** Use Xen or something so user jobs can be stopped and not just suspended.
+
* Combine linne and persephone clusters
** User jobs could be suspended to allow signal processing jobs to run immediately.
+
** dependent on finishing linne-to-bx migration
* Combine linne and persephone clusters.
+
** master/slave SGE qmasters running somewhere central and more reliable (currently c1.persephone and linne)
* Come up with some backup solution.
+
* tsm backups of s2 and s3?
* Replace BioTeam iNquiry
+
* Replace BioTeam iNquiry  
 
** Use Galaxy instead?
 
** Use Galaxy instead?
 
* Implement a centralized database of sequencing run information.
 
* Implement a centralized database of sequencing run information.
 +
** maybe generate this based on the filesystem layout and the presense/absence of certain files?
 +
** maybe use this for generating notifications so people know when certain parts of the pipeline are done?
 
** Basically a small LIMS.
 
** Basically a small LIMS.
** Maybe use Galaxy?
+
** Maybe integrate with galaxy
 
* After problems with md1k-2 are fixed, turn on automated scrubbing.
 
* After problems with md1k-2 are fixed, turn on automated scrubbing.
* Look into different storage systems.
 
 
* clean up old files in /afs/bx.psu.edu/depot/data/schuster_lab/old_stuff_to_cleanup
 
* clean up old files in /afs/bx.psu.edu/depot/data/schuster_lab/old_stuff_to_cleanup

Latest revision as of 13:47, 3 June 2010

in progress

  • md1k-2 disk problems - currently waiting for problems to show up again to get fresh RAID controller log entries, and to verify that switching to the spare EMM (array controller) didn't fix the problem
  • nagios monitoring (http://kaylee.bx.psu.edu/nagios [login as guest/guest])
    • who gets notified? everyone all at once, or use elapsed-time-based escalations?
    • what do we want to monitor?
      • (DONE) samba share status on s2 so illumina-ga can copy data!
      • up/down state for all nodes and servers
      • (DONE) nfs server on s2/s3 and PARTIAL all the related tcp/udp ports necessary for proper nfs operation
      • disk usage via snmp for s2+s3
      • (DONE) fault management via FMD over SNMP like we do for afs-fs{4..7}, thumper, saturn....
      • SGE queue status
      • sequencer up/down state, and maybe disk usage (can we do that for illumina-ga remotely somehow?)
  • Migrate linne to bx network. See Slab:Linne_BX_migration
    • install AFS client on all nodes
    • (PARTIAL) finish sync'ing uid's/gid's to match what is in BX LDAP
    • create BX accounts for those that don't already have them (cleanup/disable linne accounts that are no longer necessary for security reasons?)
    • point all nodes to ldap.bx.psu.edu for authZ, and switch to the BX.PSU.EDU krb5 realm for authN
    • disable the services running on linne that are no longer necessary

queued

  • attach the UPSes to s2 and s3 to enable graceful shutdown in the event of a power outage?
  • automatic snapshots for the ZFS datasets on s2 and s3 (see http://blogs.sun.com/timf/resource/README.zfs-auto-snapshot.txt)
  • more scripts:
    • migrate sequencing runs from temp to staging (currently /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/support/software/archive/move_*_temp_to_staging)
      • perhaps notify by email automatically when there are finished runs ready to be moved?
      • notify by email when this is done so any interested parties will see that it has been done, and provide paths to new runs
      • this should call a script to update symlinks and release the data.schuster_lab volume
    • script to better handle submitting illumina jobs to cluster, with email notifications
    • script to allow rsync'ing individual lanes from a run, given source directory, dest directory, and lane(s)
  • Migrate Schuster Lab machines to bx network.
    • Or at least, install AFS client and setup BX.PSU.EDU krb5 realm to handle authentication so it's easier for the schuster lab machines to work with everyone else
  • Automate the archiving of sequencing run directories.
    • Maybe after two weeks in staging they're moved into the archive?
    • need to keep symlinks up-to-date and release the data.schuster_lab volume appropriately!
  • Combine linne and persephone clusters
    • dependent on finishing linne-to-bx migration
    • master/slave SGE qmasters running somewhere central and more reliable (currently c1.persephone and linne)
  • tsm backups of s2 and s3?
  • Replace BioTeam iNquiry
    • Use Galaxy instead?
  • Implement a centralized database of sequencing run information.
    • maybe generate this based on the filesystem layout and the presense/absence of certain files?
    • maybe use this for generating notifications so people know when certain parts of the pipeline are done?
    • Basically a small LIMS.
    • Maybe integrate with galaxy
  • After problems with md1k-2 are fixed, turn on automated scrubbing.
  • clean up old files in /afs/bx.psu.edu/depot/data/schuster_lab/old_stuff_to_cleanup