Latest revision as of 13:47, 3 June 2010

in progress

md1k-2 disk problems - currently waiting for problems to show up again to get fresh RAID controller log entries, and to verify that switching to the spare EMM (array controller) didn't fix the problem
nagios monitoring (http://kaylee.bx.psu.edu/nagios [login as guest/guest])
- who gets notified? everyone all at once, or use elapsed-time-based escalations?
- what do we want to monitor?
  - (DONE) samba share status on s2 so illumina-ga can copy data!
  - up/down state for all nodes and servers
  - (DONE) nfs server on s2/s3 and PARTIAL all the related tcp/udp ports necessary for proper nfs operation
  - disk usage via snmp for s2+s3
  - (DONE) fault management via FMD over SNMP like we do for afs-fs{4..7}, thumper, saturn....
  - SGE queue status
  - sequencer up/down state, and maybe disk usage (can we do that for illumina-ga remotely somehow?)
Migrate linne to bx network. See Slab:Linne_BX_migration
- install AFS client on all nodes
- (PARTIAL) finish sync'ing uid's/gid's to match what is in BX LDAP
- create BX accounts for those that don't already have them (cleanup/disable linne accounts that are no longer necessary for security reasons?)
- point all nodes to ldap.bx.psu.edu for authZ, and switch to the BX.PSU.EDU krb5 realm for authN
- disable the services running on linne that are no longer necessary

queued

attach the UPSes to s2 and s3 to enable graceful shutdown in the event of a power outage?
automatic snapshots for the ZFS datasets on s2 and s3 (see http://blogs.sun.com/timf/resource/README.zfs-auto-snapshot.txt)
more scripts:
- migrate sequencing runs from temp to staging (currently /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/support/software/archive/move_*_temp_to_staging)
  - perhaps notify by email automatically when there are finished runs ready to be moved?
  - notify by email when this is done so any interested parties will see that it has been done, and provide paths to new runs
  - this should call a script to update symlinks and release the data.schuster_lab volume
- script to better handle submitting illumina jobs to cluster, with email notifications
- script to allow rsync'ing individual lanes from a run, given source directory, dest directory, and lane(s)
Migrate Schuster Lab machines to bx network.
- Or at least, install AFS client and setup BX.PSU.EDU krb5 realm to handle authentication so it's easier for the schuster lab machines to work with everyone else
Automate the archiving of sequencing run directories.
- Maybe after two weeks in staging they're moved into the archive?
- need to keep symlinks up-to-date and release the data.schuster_lab volume appropriately!
Combine linne and persephone clusters
- dependent on finishing linne-to-bx migration
- master/slave SGE qmasters running somewhere central and more reliable (currently c1.persephone and linne)
tsm backups of s2 and s3?
Replace BioTeam iNquiry
- Use Galaxy instead?
Implement a centralized database of sequencing run information.
- maybe generate this based on the filesystem layout and the presense/absence of certain files?
- maybe use this for generating notifications so people know when certain parts of the pipeline are done?
- Basically a small LIMS.
- Maybe integrate with galaxy
After problems with md1k-2 are fixed, turn on automated scrubbing.
clean up old files in /afs/bx.psu.edu/depot/data/schuster_lab/old_stuff_to_cleanup

@@ Line 1: / Line 1: @@
 = in progress =
 * [[slab:md1k-2 disk problems|md1k-2 disk problems]] - currently waiting for problems to show up again to get fresh RAID controller log entries, and to verify that switching to the spare EMM (array controller) didn't fix the problem
+* nagios monitoring (http://kaylee.bx.psu.edu/nagios  [login as guest/guest])
+** who gets notified? everyone all at once, or use elapsed-time-based escalations?
+** what do we want to monitor?
+*** '''(DONE)''' samba share status on s2 so illumina-ga can copy data!
+*** up/down state for all nodes and servers
+*** '''(DONE)''' nfs server on s2/s3 and '''PARTIAL''' all the related tcp/udp ports necessary for proper nfs operation
+*** disk usage via snmp for s2+s3
+*** '''(DONE)''' fault management via FMD over SNMP like we do for afs-fs{4..7}, thumper, saturn....
+*** SGE queue status
+*** sequencer up/down state, and maybe disk usage (can we do that for illumina-ga remotely somehow?)
+* Migrate linne to bx network. See [[Slab:Linne_BX_migration]]
+** install AFS client on all nodes
+** '''(PARTIAL)''' finish sync'ing uid's/gid's to match what is in BX LDAP
+** create BX accounts for those that don't already have them (cleanup/disable linne accounts that are no longer necessary for security reasons?)
+** point all nodes to ldap.bx.psu.edu for authZ, and switch to the BX.PSU.EDU krb5 realm for authN
+** disable the services running on linne that are no longer necessary
 = queued =
+* attach the UPSes to s2 and s3 to enable graceful shutdown in the event of a power outage?
 * automatic snapshots for the ZFS datasets on s2 and s3 (see http://blogs.sun.com/timf/resource/README.zfs-auto-snapshot.txt)
 * more scripts:
@@ Line 10: / Line 27: @@
 *** this should call a script to update symlinks and release the data.schuster_lab volume
 ** script to better handle submitting illumina jobs to cluster, with email notifications
-* Migrate linne to bx network.
+** script to allow rsync'ing individual lanes from a run, given source directory, dest directory, and lane(s)
 * Migrate Schuster Lab machines to bx network.
-** Or at least, install AFS client.
+** Or at least, install AFS client and setup BX.PSU.EDU krb5 realm to handle authentication so it's easier for the schuster lab machines to work with everyone else
 * Automate the archiving of sequencing run directories.
 ** Maybe after two weeks in staging they're moved into the archive?
-* Combine linne and persephone clusters - dependent on finishing linne-to-bx migration
+** need to keep symlinks up-to-date and release the data.schuster_lab volume appropriately!
+* Combine linne and persephone clusters
+** dependent on finishing linne-to-bx migration
+** master/slave SGE qmasters running somewhere central and more reliable (currently c1.persephone and linne)
 * tsm backups of s2 and s3?
 * Replace BioTeam iNquiry
 ** Use Galaxy instead?
 * Implement a centralized database of sequencing run information.
+** maybe generate this based on the filesystem layout and the presense/absence of certain files?
+** maybe use this for generating notifications so people know when certain parts of the pipeline are done?
 ** Basically a small LIMS.
 ** Maybe integrate with galaxy
 * After problems with md1k-2 are fixed, turn on automated scrubbing.
 * clean up old files in /afs/bx.psu.edu/depot/data/schuster_lab/old_stuff_to_cleanup

Difference between revisions of "SLab:Todo"

Latest revision as of 13:47, 3 June 2010

in progress

queued

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools