Latest revision as of 13:47, 3 June 2010

in progress

md1k-2 disk problems - currently waiting for problems to show up again to get fresh RAID controller log entries, and to verify that switching to the spare EMM (array controller) didn't fix the problem
nagios monitoring (http://kaylee.bx.psu.edu/nagios [login as guest/guest])
- who gets notified? everyone all at once, or use elapsed-time-based escalations?
- what do we want to monitor?
  - (DONE) samba share status on s2 so illumina-ga can copy data!
  - up/down state for all nodes and servers
  - (DONE) nfs server on s2/s3 and PARTIAL all the related tcp/udp ports necessary for proper nfs operation
  - disk usage via snmp for s2+s3
  - (DONE) fault management via FMD over SNMP like we do for afs-fs{4..7}, thumper, saturn....
  - SGE queue status
  - sequencer up/down state, and maybe disk usage (can we do that for illumina-ga remotely somehow?)
Migrate linne to bx network. See Slab:Linne_BX_migration
- install AFS client on all nodes
- (PARTIAL) finish sync'ing uid's/gid's to match what is in BX LDAP
- create BX accounts for those that don't already have them (cleanup/disable linne accounts that are no longer necessary for security reasons?)
- point all nodes to ldap.bx.psu.edu for authZ, and switch to the BX.PSU.EDU krb5 realm for authN
- disable the services running on linne that are no longer necessary

queued

attach the UPSes to s2 and s3 to enable graceful shutdown in the event of a power outage?
automatic snapshots for the ZFS datasets on s2 and s3 (see http://blogs.sun.com/timf/resource/README.zfs-auto-snapshot.txt)
more scripts:
- migrate sequencing runs from temp to staging (currently /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/support/software/archive/move_*_temp_to_staging)
  - perhaps notify by email automatically when there are finished runs ready to be moved?
  - notify by email when this is done so any interested parties will see that it has been done, and provide paths to new runs
  - this should call a script to update symlinks and release the data.schuster_lab volume
- script to better handle submitting illumina jobs to cluster, with email notifications
- script to allow rsync'ing individual lanes from a run, given source directory, dest directory, and lane(s)
Migrate Schuster Lab machines to bx network.
- Or at least, install AFS client and setup BX.PSU.EDU krb5 realm to handle authentication so it's easier for the schuster lab machines to work with everyone else
Automate the archiving of sequencing run directories.
- Maybe after two weeks in staging they're moved into the archive?
- need to keep symlinks up-to-date and release the data.schuster_lab volume appropriately!
Combine linne and persephone clusters
- dependent on finishing linne-to-bx migration
- master/slave SGE qmasters running somewhere central and more reliable (currently c1.persephone and linne)
tsm backups of s2 and s3?
Replace BioTeam iNquiry
- Use Galaxy instead?
Implement a centralized database of sequencing run information.
- maybe generate this based on the filesystem layout and the presense/absence of certain files?
- maybe use this for generating notifications so people know when certain parts of the pipeline are done?
- Basically a small LIMS.
- Maybe integrate with galaxy
After problems with md1k-2 are fixed, turn on automated scrubbing.
clean up old files in /afs/bx.psu.edu/depot/data/schuster_lab/old_stuff_to_cleanup

@@ Line 1: / Line 1: @@
-== CRITICAL DISK STORAGE PROBLEM WITH MD1K-2 ==
+= in progress =
+* [[slab:md1k-2 disk problems|md1k-2 disk problems]] - currently waiting for problems to show up again to get fresh RAID controller log entries, and to verify that switching to the spare EMM (array controller) didn't fix the problem
+* nagios monitoring (http://kaylee.bx.psu.edu/nagios  [login as guest/guest])
+** who gets notified? everyone all at once, or use elapsed-time-based escalations?
+** what do we want to monitor?
+*** '''(DONE)''' samba share status on s2 so illumina-ga can copy data!
+*** up/down state for all nodes and servers
+*** '''(DONE)''' nfs server on s2/s3 and '''PARTIAL''' all the related tcp/udp ports necessary for proper nfs operation
+*** disk usage via snmp for s2+s3
+*** '''(DONE)''' fault management via FMD over SNMP like we do for afs-fs{4..7}, thumper, saturn....
+*** SGE queue status
+*** sequencer up/down state, and maybe disk usage (can we do that for illumina-ga remotely somehow?)
+* Migrate linne to bx network. See [[Slab:Linne_BX_migration]]
+** install AFS client on all nodes
+** '''(PARTIAL)''' finish sync'ing uid's/gid's to match what is in BX LDAP
+** create BX accounts for those that don't already have them (cleanup/disable linne accounts that are no longer necessary for security reasons?)
+** point all nodes to ldap.bx.psu.edu for authZ, and switch to the BX.PSU.EDU krb5 realm for authN
+** disable the services running on linne that are no longer necessary
-There is a serious problem with md1k-2, one of the PowerVault MD1000's connected to s3.  It will get into a state where two disks appear to have failed.  A two-disk failure when using RAID-5 would mean complete data loss.  Fortunately, I've found a remedy that is allowing us to copy the data to a different array.
+= queued =
+* attach the UPSes to s2 and s3 to enable graceful shutdown in the event of a power outage?
-<pre>
+* automatic snapshots for the ZFS datasets on s2 and s3 (see http://blogs.sun.com/timf/resource/README.zfs-auto-snapshot.txt)
-Dell Higher Education Support
+* more scripts:
--800-274-7799
+** migrate sequencing runs from temp to staging (currently /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/support/software/archive/move_*_temp_to_staging)
-Enter Express Service Code 43288304365 when prompted on the call.
+*** perhaps notify by email automatically when there are finished runs ready to be moved?
-</pre>
+*** notify by email when this is done so any interested parties will see that it has been done, and provide paths to new runs
+*** this should call a script to update symlinks and release the data.schuster_lab volume
-{| border="1"
+** script to better handle submitting illumina jobs to cluster, with email notifications
-|host
+** script to allow rsync'ing individual lanes from a run, given source directory, dest directory, and lane(s)
-|service tag
+* Migrate Schuster Lab machines to bx network.
-|contract end
+** Or at least, install AFS client and setup BX.PSU.EDU krb5 realm to handle authentication so it's easier for the schuster lab machines to work with everyone else
-|description
+* Automate the archiving of sequencing run directories.
-|notes
+** Maybe after two weeks in staging they're moved into the archive?
-|-
+** need to keep symlinks up-to-date and release the data.schuster_lab volume appropriately!
-|c8
+* Combine linne and persephone clusters
-|CPM1NF1
+** dependent on finishing linne-to-bx migration
-|02/15/2011
+** master/slave SGE qmasters running somewhere central and more reliable (currently c1.persephone and linne)
-|PowerEdge 1950
+* tsm backups of s2 and s3?
-|old schuster storage, moved PERC 5/E to s3
+* Replace BioTeam iNquiry
-|-
+** Use Galaxy instead?
-|s3
+* Implement a centralized database of sequencing run information.
-|GHNCVH1
+** maybe generate this based on the filesystem layout and the presense/absence of certain files?
-|06/22/2012
+** maybe use this for generating notifications so people know when certain parts of the pipeline are done?
-|PowerEdge 1950
+** Basically a small LIMS.
-|connected to md1k-2 via PERC 5/E
+** Maybe integrate with galaxy
-|-
+* After problems with md1k-2 are fixed, turn on automated scrubbing.
-|md1k-1
+* clean up old files in /afs/bx.psu.edu/depot/data/schuster_lab/old_stuff_to_cleanup
-|FVWQLF1
-|02/07/2011
-|PowerVault MD1000
-|enclosure 3
-|-
-|md1k-2
-|JVWQLF1
-|02/07/2011
-|PowerVault MD1000
-|enclosure 2
-|-
-|md1k-3
-|4X9NLF1
-|03/06/2011
-|PowerVault MD1000
-|enclosure 1
-|}
-I've narrowed down the error by looking through the adapter event logs.  There will be two errors, one followed about 45 seconds after the first:
-<pre>
-Tue Mar 23 23:09:56 2010   Error on PD 31(e2/s0) (Error f0)
-Tue Mar 23 23:10:43 2010   Error on PD 30(e2/s1) (Error f0)
-</pre>
-After that, the virtual disk goes offline.
-<pre>
-s3% MegaCli -LDInfo L1 -a0
-Adapter 0 -- Virtual Drive Information:
-Virtual Disk: 1 (Target Id: 1)
-Name:md1k-2
-RAID Level: Primary-5, Secondary-0, RAID Level Qualifier-3
-Size:8.862 TB
-State: Offline
-Stripe Size: 64 KB
-Number Of Drives:14
-Span Depth:1
-Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
-Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
-Access Policy: Read/Write
-Disk Cache Policy: Disk's Default
-Encryption Type: None
-Number of Dedicated Hot Spares: 1
-: EnclId - 18 SlotId - 1
-</pre>
-If you go into the server room, you'll see flashing amber lights on the disks in md1k-2 slot 0 and 1.  I can md1k-2 back using the following procedure (replace slot 0 and 1 which whichever slots are appropriate):
-# Take the disks in md1k-2 slots 0 and 1 about half-way out and then push them back in.
-# Wait a few seconds for the lights on md1k-2 slots 0 and 1 to return to green.
-# Press the power button on s3 until it turns off.
-# Press the power button on s3 again to turn it back on.
-== Miscellaneous ==

Difference between revisions of "SLab:Todo"

Latest revision as of 13:47, 3 June 2010

in progress

queued

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools