Difference between revisions of "SLab:Todo"

From CCGB
Jump to: navigation, search
(CRITICAL)
 
(48 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== CRITICAL DISK STORAGE PROBLEM WITH MD1K-2 ==
+
= in progress =
 +
* [[slab:md1k-2 disk problems|md1k-2 disk problems]] - currently waiting for problems to show up again to get fresh RAID controller log entries, and to verify that switching to the spare EMM (array controller) didn't fix the problem
 +
* nagios monitoring (http://kaylee.bx.psu.edu/nagios  [login as guest/guest])
 +
** who gets notified? everyone all at once, or use elapsed-time-based escalations?
 +
** what do we want to monitor?
 +
*** '''(DONE)''' samba share status on s2 so illumina-ga can copy data!
 +
*** up/down state for all nodes and servers
 +
*** '''(DONE)''' nfs server on s2/s3 and '''PARTIAL''' all the related tcp/udp ports necessary for proper nfs operation
 +
*** disk usage via snmp for s2+s3
 +
*** '''(DONE)''' fault management via FMD over SNMP like we do for afs-fs{4..7}, thumper, saturn....
 +
*** SGE queue status
 +
*** sequencer up/down state, and maybe disk usage (can we do that for illumina-ga remotely somehow?)
 +
* Migrate linne to bx network. See [[Slab:Linne_BX_migration]]
 +
** install AFS client on all nodes
 +
** '''(PARTIAL)''' finish sync'ing uid's/gid's to match what is in BX LDAP
 +
** create BX accounts for those that don't already have them (cleanup/disable linne accounts that are no longer necessary for security reasons?)
 +
** point all nodes to ldap.bx.psu.edu for authZ, and switch to the BX.PSU.EDU krb5 realm for authN
 +
** disable the services running on linne that are no longer necessary
  
There is a serious problem with md1k-2, one of the PowerVault MD1000's connected to s3.  It will get into a state where two disks appear to have failed.  A two-disk failure when using RAID-5 would mean complete data loss.  Fortunately, I've found a remedy that is allowing us to copy the data to a different array.
+
= queued =
 
+
* attach the UPSes to s2 and s3 to enable graceful shutdown in the event of a power outage?
<pre>
+
* automatic snapshots for the ZFS datasets on s2 and s3 (see http://blogs.sun.com/timf/resource/README.zfs-auto-snapshot.txt)
Dell Higher Education Support
+
* more scripts:
1-800-274-7799
+
** migrate sequencing runs from temp to staging (currently /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/support/software/archive/move_*_temp_to_staging)
Enter Express Service Code 43288304365 when prompted on the call.
+
*** perhaps notify by email automatically when there are finished runs ready to be moved?
</pre>
+
*** notify by email when this is done so any interested parties will see that it has been done, and provide paths to new runs
 
+
*** this should call a script to update symlinks and release the data.schuster_lab volume
{| border="1"
+
** script to better handle submitting illumina jobs to cluster, with email notifications
|host
+
** script to allow rsync'ing individual lanes from a run, given source directory, dest directory, and lane(s)
|service tag
+
* Migrate Schuster Lab machines to bx network.
|contract end
+
** Or at least, install AFS client and setup BX.PSU.EDU krb5 realm to handle authentication so it's easier for the schuster lab machines to work with everyone else
|description
+
* Automate the archiving of sequencing run directories.
|notes
+
** Maybe after two weeks in staging they're moved into the archive?
|-
+
** need to keep symlinks up-to-date and release the data.schuster_lab volume appropriately!
|c8
+
* Combine linne and persephone clusters
|CPM1NF1
+
** dependent on finishing linne-to-bx migration
|02/15/2011
+
** master/slave SGE qmasters running somewhere central and more reliable (currently c1.persephone and linne)
|PowerEdge 1950
+
* tsm backups of s2 and s3?
|old schuster storage, moved PERC 5/E to s3
+
* Replace BioTeam iNquiry
|-
+
** Use Galaxy instead?
|s3
+
* Implement a centralized database of sequencing run information.
|GHNCVH1
+
** maybe generate this based on the filesystem layout and the presense/absence of certain files?
|06/22/2012
+
** maybe use this for generating notifications so people know when certain parts of the pipeline are done?
|PowerEdge 1950
+
** Basically a small LIMS.
|connected to md1k-2 via PERC 5/E
+
** Maybe integrate with galaxy
|-
+
* After problems with md1k-2 are fixed, turn on automated scrubbing.
|md1k-1
+
* clean up old files in /afs/bx.psu.edu/depot/data/schuster_lab/old_stuff_to_cleanup
|FVWQLF1
 
|02/07/2011
 
|PowerVault MD1000
 
|enclosure 3
 
|-
 
|md1k-2
 
|JVWQLF1
 
|02/07/2011
 
|PowerVault MD1000
 
|enclosure 2
 
|-
 
|md1k-3
 
|4X9NLF1
 
|03/06/2011
 
|PowerVault MD1000
 
|enclosure 1
 
|}
 
 
 
 
 
I've narrowed down the error by looking through the adapter event logs.  There will be two errors, one followed about 45 seconds after the first:
 
<pre>
 
Tue Mar 23 23:09:56 2010  Error on PD 31(e2/s0) (Error f0)
 
Tue Mar 23 23:10:43 2010  Error on PD 30(e2/s1) (Error f0)
 
</pre>
 
 
 
After that, the virtual disk goes offline.  
 
 
 
<pre>
 
s3% MegaCli -LDInfo L1 -a0
 
 
 
Adapter 0 -- Virtual Drive Information:
 
Virtual Disk: 1 (Target Id: 1)
 
Name:md1k-2
 
RAID Level: Primary-5, Secondary-0, RAID Level Qualifier-3
 
Size:8.862 TB
 
State: Offline
 
Stripe Size: 64 KB
 
Number Of Drives:14
 
Span Depth:1
 
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
 
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
 
Access Policy: Read/Write
 
Disk Cache Policy: Disk's Default
 
Encryption Type: None
 
Number of Dedicated Hot Spares: 1
 
    0 : EnclId - 18 SlotId - 1
 
</pre>
 
 
 
If you go into the server room, you'll see flashing amber lights on the disks in md1k-2 slot 0 and 1.  I can md1k-2 back using the following procedure (replace slot 0 and 1 which whichever slots are appropriate):
 
 
 
# Take the disks in md1k-2 slots 0 and 1 about half-way out and then push them back in.
 
# Wait a few seconds for the lights on md1k-2 slots 0 and 1 to return to green.
 
# Press the power button on s3 until it turns off.
 
# Press the power button on s3 again to turn it back on.
 
 
 
== Miscellaneous ==
 

Latest revision as of 13:47, 3 June 2010

in progress

  • md1k-2 disk problems - currently waiting for problems to show up again to get fresh RAID controller log entries, and to verify that switching to the spare EMM (array controller) didn't fix the problem
  • nagios monitoring (http://kaylee.bx.psu.edu/nagios [login as guest/guest])
    • who gets notified? everyone all at once, or use elapsed-time-based escalations?
    • what do we want to monitor?
      • (DONE) samba share status on s2 so illumina-ga can copy data!
      • up/down state for all nodes and servers
      • (DONE) nfs server on s2/s3 and PARTIAL all the related tcp/udp ports necessary for proper nfs operation
      • disk usage via snmp for s2+s3
      • (DONE) fault management via FMD over SNMP like we do for afs-fs{4..7}, thumper, saturn....
      • SGE queue status
      • sequencer up/down state, and maybe disk usage (can we do that for illumina-ga remotely somehow?)
  • Migrate linne to bx network. See Slab:Linne_BX_migration
    • install AFS client on all nodes
    • (PARTIAL) finish sync'ing uid's/gid's to match what is in BX LDAP
    • create BX accounts for those that don't already have them (cleanup/disable linne accounts that are no longer necessary for security reasons?)
    • point all nodes to ldap.bx.psu.edu for authZ, and switch to the BX.PSU.EDU krb5 realm for authN
    • disable the services running on linne that are no longer necessary

queued

  • attach the UPSes to s2 and s3 to enable graceful shutdown in the event of a power outage?
  • automatic snapshots for the ZFS datasets on s2 and s3 (see http://blogs.sun.com/timf/resource/README.zfs-auto-snapshot.txt)
  • more scripts:
    • migrate sequencing runs from temp to staging (currently /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/support/software/archive/move_*_temp_to_staging)
      • perhaps notify by email automatically when there are finished runs ready to be moved?
      • notify by email when this is done so any interested parties will see that it has been done, and provide paths to new runs
      • this should call a script to update symlinks and release the data.schuster_lab volume
    • script to better handle submitting illumina jobs to cluster, with email notifications
    • script to allow rsync'ing individual lanes from a run, given source directory, dest directory, and lane(s)
  • Migrate Schuster Lab machines to bx network.
    • Or at least, install AFS client and setup BX.PSU.EDU krb5 realm to handle authentication so it's easier for the schuster lab machines to work with everyone else
  • Automate the archiving of sequencing run directories.
    • Maybe after two weeks in staging they're moved into the archive?
    • need to keep symlinks up-to-date and release the data.schuster_lab volume appropriately!
  • Combine linne and persephone clusters
    • dependent on finishing linne-to-bx migration
    • master/slave SGE qmasters running somewhere central and more reliable (currently c1.persephone and linne)
  • tsm backups of s2 and s3?
  • Replace BioTeam iNquiry
    • Use Galaxy instead?
  • Implement a centralized database of sequencing run information.
    • maybe generate this based on the filesystem layout and the presense/absence of certain files?
    • maybe use this for generating notifications so people know when certain parts of the pipeline are done?
    • Basically a small LIMS.
    • Maybe integrate with galaxy
  • After problems with md1k-2 are fixed, turn on automated scrubbing.
  • clean up old files in /afs/bx.psu.edu/depot/data/schuster_lab/old_stuff_to_cleanup