Difference between revisions of "SLab:Todo"

From CCGB
Jump to: navigation, search
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
= in progress =
 
= in progress =
 
* [[slab:md1k-2 disk problems|md1k-2 disk problems]] - currently waiting for problems to show up again to get fresh RAID controller log entries, and to verify that switching to the spare EMM (array controller) didn't fix the problem
 
* [[slab:md1k-2 disk problems|md1k-2 disk problems]] - currently waiting for problems to show up again to get fresh RAID controller log entries, and to verify that switching to the spare EMM (array controller) didn't fix the problem
 
= queued =
 
 
* nagios monitoring (http://kaylee.bx.psu.edu/nagios  [login as guest/guest])
 
* nagios monitoring (http://kaylee.bx.psu.edu/nagios  [login as guest/guest])
 
** who gets notified? everyone all at once, or use elapsed-time-based escalations?
 
** who gets notified? everyone all at once, or use elapsed-time-based escalations?
 
** what do we want to monitor?
 
** what do we want to monitor?
 +
*** '''(DONE)''' samba share status on s2 so illumina-ga can copy data!
 
*** up/down state for all nodes and servers
 
*** up/down state for all nodes and servers
*** nfs server on s2/s3 and all the related tcp/udp ports necessary for proper nfs operation
+
*** '''(DONE)''' nfs server on s2/s3 and '''PARTIAL''' all the related tcp/udp ports necessary for proper nfs operation
 
*** disk usage via snmp for s2+s3
 
*** disk usage via snmp for s2+s3
*** fault management via FMD over SNMP like we do for afs-fs{4..7}, thumper, saturn....
+
*** '''(DONE)''' fault management via FMD over SNMP like we do for afs-fs{4..7}, thumper, saturn....
 
*** SGE queue status
 
*** SGE queue status
 
*** sequencer up/down state, and maybe disk usage (can we do that for illumina-ga remotely somehow?)
 
*** sequencer up/down state, and maybe disk usage (can we do that for illumina-ga remotely somehow?)
 +
* Migrate linne to bx network. See [[Slab:Linne_BX_migration]]
 +
** install AFS client on all nodes
 +
** '''(PARTIAL)''' finish sync'ing uid's/gid's to match what is in BX LDAP
 +
** create BX accounts for those that don't already have them (cleanup/disable linne accounts that are no longer necessary for security reasons?)
 +
** point all nodes to ldap.bx.psu.edu for authZ, and switch to the BX.PSU.EDU krb5 realm for authN
 +
** disable the services running on linne that are no longer necessary
 +
 +
= queued =
 
* attach the UPSes to s2 and s3 to enable graceful shutdown in the event of a power outage?
 
* attach the UPSes to s2 and s3 to enable graceful shutdown in the event of a power outage?
 
 
* automatic snapshots for the ZFS datasets on s2 and s3 (see http://blogs.sun.com/timf/resource/README.zfs-auto-snapshot.txt)
 
* automatic snapshots for the ZFS datasets on s2 and s3 (see http://blogs.sun.com/timf/resource/README.zfs-auto-snapshot.txt)
 
* more scripts:
 
* more scripts:
Line 21: Line 27:
 
*** this should call a script to update symlinks and release the data.schuster_lab volume
 
*** this should call a script to update symlinks and release the data.schuster_lab volume
 
** script to better handle submitting illumina jobs to cluster, with email notifications
 
** script to better handle submitting illumina jobs to cluster, with email notifications
* Migrate linne to bx network.
+
** script to allow rsync'ing individual lanes from a run, given source directory, dest directory, and lane(s)
** install AFS client on all nodes
 
** finish sync'ing uid's/gid's to match what is in BX LDAP
 
** create BX accounts for those that don't already have them (cleanup/disable linne accounts that are no longer necessary for security reasons?)
 
** point all nodes to ldap.bx.psu.edu for authZ, and switch to the BX.PSU.EDU krb5 realm for authN
 
** disable the services running on linne that are no longer necessary
 
 
* Migrate Schuster Lab machines to bx network.
 
* Migrate Schuster Lab machines to bx network.
 
** Or at least, install AFS client and setup BX.PSU.EDU krb5 realm to handle authentication so it's easier for the schuster lab machines to work with everyone else
 
** Or at least, install AFS client and setup BX.PSU.EDU krb5 realm to handle authentication so it's easier for the schuster lab machines to work with everyone else

Latest revision as of 13:47, 3 June 2010

in progress

  • md1k-2 disk problems - currently waiting for problems to show up again to get fresh RAID controller log entries, and to verify that switching to the spare EMM (array controller) didn't fix the problem
  • nagios monitoring (http://kaylee.bx.psu.edu/nagios [login as guest/guest])
    • who gets notified? everyone all at once, or use elapsed-time-based escalations?
    • what do we want to monitor?
      • (DONE) samba share status on s2 so illumina-ga can copy data!
      • up/down state for all nodes and servers
      • (DONE) nfs server on s2/s3 and PARTIAL all the related tcp/udp ports necessary for proper nfs operation
      • disk usage via snmp for s2+s3
      • (DONE) fault management via FMD over SNMP like we do for afs-fs{4..7}, thumper, saturn....
      • SGE queue status
      • sequencer up/down state, and maybe disk usage (can we do that for illumina-ga remotely somehow?)
  • Migrate linne to bx network. See Slab:Linne_BX_migration
    • install AFS client on all nodes
    • (PARTIAL) finish sync'ing uid's/gid's to match what is in BX LDAP
    • create BX accounts for those that don't already have them (cleanup/disable linne accounts that are no longer necessary for security reasons?)
    • point all nodes to ldap.bx.psu.edu for authZ, and switch to the BX.PSU.EDU krb5 realm for authN
    • disable the services running on linne that are no longer necessary

queued

  • attach the UPSes to s2 and s3 to enable graceful shutdown in the event of a power outage?
  • automatic snapshots for the ZFS datasets on s2 and s3 (see http://blogs.sun.com/timf/resource/README.zfs-auto-snapshot.txt)
  • more scripts:
    • migrate sequencing runs from temp to staging (currently /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/support/software/archive/move_*_temp_to_staging)
      • perhaps notify by email automatically when there are finished runs ready to be moved?
      • notify by email when this is done so any interested parties will see that it has been done, and provide paths to new runs
      • this should call a script to update symlinks and release the data.schuster_lab volume
    • script to better handle submitting illumina jobs to cluster, with email notifications
    • script to allow rsync'ing individual lanes from a run, given source directory, dest directory, and lane(s)
  • Migrate Schuster Lab machines to bx network.
    • Or at least, install AFS client and setup BX.PSU.EDU krb5 realm to handle authentication so it's easier for the schuster lab machines to work with everyone else
  • Automate the archiving of sequencing run directories.
    • Maybe after two weeks in staging they're moved into the archive?
    • need to keep symlinks up-to-date and release the data.schuster_lab volume appropriately!
  • Combine linne and persephone clusters
    • dependent on finishing linne-to-bx migration
    • master/slave SGE qmasters running somewhere central and more reliable (currently c1.persephone and linne)
  • tsm backups of s2 and s3?
  • Replace BioTeam iNquiry
    • Use Galaxy instead?
  • Implement a centralized database of sequencing run information.
    • maybe generate this based on the filesystem layout and the presense/absence of certain files?
    • maybe use this for generating notifications so people know when certain parts of the pipeline are done?
    • Basically a small LIMS.
    • Maybe integrate with galaxy
  • After problems with md1k-2 are fixed, turn on automated scrubbing.
  • clean up old files in /afs/bx.psu.edu/depot/data/schuster_lab/old_stuff_to_cleanup