Latest revision as of 13:47, 3 June 2010

in progress

md1k-2 disk problems - currently waiting for problems to show up again to get fresh RAID controller log entries, and to verify that switching to the spare EMM (array controller) didn't fix the problem
nagios monitoring (http://kaylee.bx.psu.edu/nagios [login as guest/guest])
- who gets notified? everyone all at once, or use elapsed-time-based escalations?
- what do we want to monitor?
  - (DONE) samba share status on s2 so illumina-ga can copy data!
  - up/down state for all nodes and servers
  - (DONE) nfs server on s2/s3 and PARTIAL all the related tcp/udp ports necessary for proper nfs operation
  - disk usage via snmp for s2+s3
  - (DONE) fault management via FMD over SNMP like we do for afs-fs{4..7}, thumper, saturn....
  - SGE queue status
  - sequencer up/down state, and maybe disk usage (can we do that for illumina-ga remotely somehow?)
Migrate linne to bx network. See Slab:Linne_BX_migration
- install AFS client on all nodes
- (PARTIAL) finish sync'ing uid's/gid's to match what is in BX LDAP
- create BX accounts for those that don't already have them (cleanup/disable linne accounts that are no longer necessary for security reasons?)
- point all nodes to ldap.bx.psu.edu for authZ, and switch to the BX.PSU.EDU krb5 realm for authN
- disable the services running on linne that are no longer necessary

queued

attach the UPSes to s2 and s3 to enable graceful shutdown in the event of a power outage?
automatic snapshots for the ZFS datasets on s2 and s3 (see http://blogs.sun.com/timf/resource/README.zfs-auto-snapshot.txt)
more scripts:
- migrate sequencing runs from temp to staging (currently /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/support/software/archive/move_*_temp_to_staging)
  - perhaps notify by email automatically when there are finished runs ready to be moved?
  - notify by email when this is done so any interested parties will see that it has been done, and provide paths to new runs
  - this should call a script to update symlinks and release the data.schuster_lab volume
- script to better handle submitting illumina jobs to cluster, with email notifications
- script to allow rsync'ing individual lanes from a run, given source directory, dest directory, and lane(s)
Migrate Schuster Lab machines to bx network.
- Or at least, install AFS client and setup BX.PSU.EDU krb5 realm to handle authentication so it's easier for the schuster lab machines to work with everyone else
Automate the archiving of sequencing run directories.
- Maybe after two weeks in staging they're moved into the archive?
- need to keep symlinks up-to-date and release the data.schuster_lab volume appropriately!
Combine linne and persephone clusters
- dependent on finishing linne-to-bx migration
- master/slave SGE qmasters running somewhere central and more reliable (currently c1.persephone and linne)
tsm backups of s2 and s3?
Replace BioTeam iNquiry
- Use Galaxy instead?
Implement a centralized database of sequencing run information.
- maybe generate this based on the filesystem layout and the presense/absence of certain files?
- maybe use this for generating notifications so people know when certain parts of the pipeline are done?
- Basically a small LIMS.
- Maybe integrate with galaxy
After problems with md1k-2 are fixed, turn on automated scrubbing.
clean up old files in /afs/bx.psu.edu/depot/data/schuster_lab/old_stuff_to_cleanup

@@ Line 1: / Line 1: @@
-== CRITICAL ==
+= in progress =
+* [[slab:md1k-2 disk problems|md1k-2 disk problems]] - currently waiting for problems to show up again to get fresh RAID controller log entries, and to verify that switching to the spare EMM (array controller) didn't fix the problem
+* nagios monitoring (http://kaylee.bx.psu.edu/nagios  [login as guest/guest])
+** who gets notified? everyone all at once, or use elapsed-time-based escalations?
+** what do we want to monitor?
+*** '''(DONE)''' samba share status on s2 so illumina-ga can copy data!
+*** up/down state for all nodes and servers
+*** '''(DONE)''' nfs server on s2/s3 and '''PARTIAL''' all the related tcp/udp ports necessary for proper nfs operation
+*** disk usage via snmp for s2+s3
+*** '''(DONE)''' fault management via FMD over SNMP like we do for afs-fs{4..7}, thumper, saturn....
+*** SGE queue status
+*** sequencer up/down state, and maybe disk usage (can we do that for illumina-ga remotely somehow?)
+* Migrate linne to bx network. See [[Slab:Linne_BX_migration]]
+** install AFS client on all nodes
+** '''(PARTIAL)''' finish sync'ing uid's/gid's to match what is in BX LDAP
+** create BX accounts for those that don't already have them (cleanup/disable linne accounts that are no longer necessary for security reasons?)
+** point all nodes to ldap.bx.psu.edu for authZ, and switch to the BX.PSU.EDU krb5 realm for authN
+** disable the services running on linne that are no longer necessary
-There is a serious problem with md1k-2, one of the PowerVault MD1000's connected to s3.  It will get into a state where two disks appear to have failed.  A two-disk failure when using RAID-5 would mean complete data loss.  Fortunately, I've found a remedy that is allowing us to copy the data to a different array.
+= queued =
+* attach the UPSes to s2 and s3 to enable graceful shutdown in the event of a power outage?
-<pre>
+* automatic snapshots for the ZFS datasets on s2 and s3 (see http://blogs.sun.com/timf/resource/README.zfs-auto-snapshot.txt)
-Dell Higher Education Support
+* more scripts:
--800-274-7799
+** migrate sequencing runs from temp to staging (currently /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/support/software/archive/move_*_temp_to_staging)
-Enter Express Service Code 43288304365 when prompted on the call.
+*** perhaps notify by email automatically when there are finished runs ready to be moved?
-</pre>
+*** notify by email when this is done so any interested parties will see that it has been done, and provide paths to new runs
+*** this should call a script to update symlinks and release the data.schuster_lab volume
-{| border="1"
+** script to better handle submitting illumina jobs to cluster, with email notifications
-|host
+** script to allow rsync'ing individual lanes from a run, given source directory, dest directory, and lane(s)
-|service tag
+* Migrate Schuster Lab machines to bx network.
-|contract end
+** Or at least, install AFS client and setup BX.PSU.EDU krb5 realm to handle authentication so it's easier for the schuster lab machines to work with everyone else
-|description
+* Automate the archiving of sequencing run directories.
-|notes
+** Maybe after two weeks in staging they're moved into the archive?
-|-
+** need to keep symlinks up-to-date and release the data.schuster_lab volume appropriately!
-|c8
+* Combine linne and persephone clusters
-|CPM1NF1
+** dependent on finishing linne-to-bx migration
-|02/15/2011
+** master/slave SGE qmasters running somewhere central and more reliable (currently c1.persephone and linne)
-|PowerEdge 1950
+* tsm backups of s2 and s3?
-|old schuster storage, moved PERC 5/E to s3
+* Replace BioTeam iNquiry
-|-
+** Use Galaxy instead?
-|s3
+* Implement a centralized database of sequencing run information.
-|GHNCVH1
+** maybe generate this based on the filesystem layout and the presense/absence of certain files?
-|06/22/2012
+** maybe use this for generating notifications so people know when certain parts of the pipeline are done?
-|PowerEdge 1950
+** Basically a small LIMS.
-|connected to md1k-2 via PERC 5/E
+** Maybe integrate with galaxy
-|-
+* After problems with md1k-2 are fixed, turn on automated scrubbing.
-|md1k-1
+* clean up old files in /afs/bx.psu.edu/depot/data/schuster_lab/old_stuff_to_cleanup
-|FVWQLF1
-|02/07/2011
-|PowerVault MD1000
-|enclosure 3
-|-
-|md1k-2
-|JVWQLF1
-|02/07/2011
-|PowerVault MD1000
-|enclosure 2
-|-
-|md1k-3
-|4X9NLF1
-|03/06/2011
-|PowerVault MD1000
-|enclosure 1
-|}
-I've narrowed down the error by looking through the adapter event logs.  There will be two errors, one followed about 45 seconds after the first/
-<pre>
-Tue Mar 23 23:09:56 2010   Error on PD 31(e2/s0) (Error f0)
-Tue Mar 23 23:10:43 2010   Error on PD 30(e2/s1) (Error f0)
-</pre>
-== Miscellaneous ==

Difference between revisions of "SLab:Todo"

Latest revision as of 13:47, 3 June 2010

in progress

queued

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools