Difference between revisions of "SLab:Todo"

From CCGB
Jump to: navigation, search
Line 4: Line 4:
 
= queued =
 
= queued =
 
* nagios monitoring (http://kaylee.bx.psu.edu/nagios  [login as guest/guest])
 
* nagios monitoring (http://kaylee.bx.psu.edu/nagios  [login as guest/guest])
 +
** who gets notified? everyone all at once, or use elapsed-time-based escalations?
 
** what do we want to monitor?
 
** what do we want to monitor?
 
*** up/down state for all nodes and servers
 
*** up/down state for all nodes and servers
 +
*** nfs server on s2/s3 and all the related tcp/udp ports necessary for proper nfs operation
 
*** disk usage via snmp for s2+s3
 
*** disk usage via snmp for s2+s3
 
*** fault management via FMD over SNMP like we do for afs-fs{4..7}, thumper, saturn....
 
*** fault management via FMD over SNMP like we do for afs-fs{4..7}, thumper, saturn....
Line 11: Line 13:
 
*** sequencer up/down state, and maybe disk usage (can we do that for illumina-ga remotely somehow?)
 
*** sequencer up/down state, and maybe disk usage (can we do that for illumina-ga remotely somehow?)
 
* attach the UPSes to s2 and s3 to enable graceful shutdown in the event of a power outage?
 
* attach the UPSes to s2 and s3 to enable graceful shutdown in the event of a power outage?
** who gets notified? everyone all at once, or use elapsed-time-based escalations?
+
 
 
* automatic snapshots for the ZFS datasets on s2 and s3 (see http://blogs.sun.com/timf/resource/README.zfs-auto-snapshot.txt)
 
* automatic snapshots for the ZFS datasets on s2 and s3 (see http://blogs.sun.com/timf/resource/README.zfs-auto-snapshot.txt)
 
* more scripts:
 
* more scripts:

Revision as of 18:44, 18 May 2010

in progress

  • md1k-2 disk problems - currently waiting for problems to show up again to get fresh RAID controller log entries, and to verify that switching to the spare EMM (array controller) didn't fix the problem

queued

  • nagios monitoring (http://kaylee.bx.psu.edu/nagios [login as guest/guest])
    • who gets notified? everyone all at once, or use elapsed-time-based escalations?
    • what do we want to monitor?
      • up/down state for all nodes and servers
      • nfs server on s2/s3 and all the related tcp/udp ports necessary for proper nfs operation
      • disk usage via snmp for s2+s3
      • fault management via FMD over SNMP like we do for afs-fs{4..7}, thumper, saturn....
      • SGE queue status
      • sequencer up/down state, and maybe disk usage (can we do that for illumina-ga remotely somehow?)
  • attach the UPSes to s2 and s3 to enable graceful shutdown in the event of a power outage?
  • automatic snapshots for the ZFS datasets on s2 and s3 (see http://blogs.sun.com/timf/resource/README.zfs-auto-snapshot.txt)
  • more scripts:
    • migrate sequencing runs from temp to staging (currently /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/support/software/archive/move_*_temp_to_staging)
      • perhaps notify by email automatically when there are finished runs ready to be moved?
      • notify by email when this is done so any interested parties will see that it has been done, and provide paths to new runs
      • this should call a script to update symlinks and release the data.schuster_lab volume
    • script to better handle submitting illumina jobs to cluster, with email notifications
  • Migrate linne to bx network.
    • install AFS client on all nodes
    • finish sync'ing uid's/gid's to match what is in BX LDAP
    • create BX accounts for those that don't already have them (cleanup/disable linne accounts that are no longer necessary for security reasons?)
    • point all nodes to ldap.bx.psu.edu for authZ, and switch to the BX.PSU.EDU krb5 realm for authN
    • disable the services running on linne that are no longer necessary
  • Migrate Schuster Lab machines to bx network.
    • Or at least, install AFS client and setup BX.PSU.EDU krb5 realm to handle authentication so it's easier for the schuster lab machines to work with everyone else
  • Automate the archiving of sequencing run directories.
    • Maybe after two weeks in staging they're moved into the archive?
    • need to keep symlinks up-to-date and release the data.schuster_lab volume appropriately!
  • Combine linne and persephone clusters
    • dependent on finishing linne-to-bx migration
    • master/slave SGE qmasters running somewhere central and more reliable (currently c1.persephone and linne)
  • tsm backups of s2 and s3?
  • Replace BioTeam iNquiry
    • Use Galaxy instead?
  • Implement a centralized database of sequencing run information.
    • maybe generate this based on the filesystem layout and the presense/absence of certain files?
    • maybe use this for generating notifications so people know when certain parts of the pipeline are done?
    • Basically a small LIMS.
    • Maybe integrate with galaxy
  • After problems with md1k-2 are fixed, turn on automated scrubbing.
  • clean up old files in /afs/bx.psu.edu/depot/data/schuster_lab/old_stuff_to_cleanup