Difference between revisions of "SLab:Todo"
From CCGB
(→Description of problem we're having with md1k-2) |
|||
(27 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | == | + | = in progress = |
+ | * [[slab:md1k-2 disk problems|md1k-2 disk problems]] - currently waiting for problems to show up again to get fresh RAID controller log entries, and to verify that switching to the spare EMM (array controller) didn't fix the problem | ||
+ | * nagios monitoring (http://kaylee.bx.psu.edu/nagios [login as guest/guest]) | ||
+ | ** who gets notified? everyone all at once, or use elapsed-time-based escalations? | ||
+ | ** what do we want to monitor? | ||
+ | *** '''(DONE)''' samba share status on s2 so illumina-ga can copy data! | ||
+ | *** up/down state for all nodes and servers | ||
+ | *** '''(DONE)''' nfs server on s2/s3 and '''PARTIAL''' all the related tcp/udp ports necessary for proper nfs operation | ||
+ | *** disk usage via snmp for s2+s3 | ||
+ | *** '''(DONE)''' fault management via FMD over SNMP like we do for afs-fs{4..7}, thumper, saturn.... | ||
+ | *** SGE queue status | ||
+ | *** sequencer up/down state, and maybe disk usage (can we do that for illumina-ga remotely somehow?) | ||
+ | * Migrate linne to bx network. See [[Slab:Linne_BX_migration]] | ||
+ | ** install AFS client on all nodes | ||
+ | ** '''(PARTIAL)''' finish sync'ing uid's/gid's to match what is in BX LDAP | ||
+ | ** create BX accounts for those that don't already have them (cleanup/disable linne accounts that are no longer necessary for security reasons?) | ||
+ | ** point all nodes to ldap.bx.psu.edu for authZ, and switch to the BX.PSU.EDU krb5 realm for authN | ||
+ | ** disable the services running on linne that are no longer necessary | ||
− | + | = queued = | |
− | + | * attach the UPSes to s2 and s3 to enable graceful shutdown in the event of a power outage? | |
− | + | * automatic snapshots for the ZFS datasets on s2 and s3 (see http://blogs.sun.com/timf/resource/README.zfs-auto-snapshot.txt) | |
− | + | * more scripts: | |
− | + | ** migrate sequencing runs from temp to staging (currently /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/support/software/archive/move_*_temp_to_staging) | |
− | + | *** perhaps notify by email automatically when there are finished runs ready to be moved? | |
− | + | *** notify by email when this is done so any interested parties will see that it has been done, and provide paths to new runs | |
− | + | *** this should call a script to update symlinks and release the data.schuster_lab volume | |
− | + | ** script to better handle submitting illumina jobs to cluster, with email notifications | |
− | + | ** script to allow rsync'ing individual lanes from a run, given source directory, dest directory, and lane(s) | |
− | + | * Migrate Schuster Lab machines to bx network. | |
− | + | ** Or at least, install AFS client and setup BX.PSU.EDU krb5 realm to handle authentication so it's easier for the schuster lab machines to work with everyone else | |
− | + | * Automate the archiving of sequencing run directories. | |
− | + | ** Maybe after two weeks in staging they're moved into the archive? | |
− | + | ** need to keep symlinks up-to-date and release the data.schuster_lab volume appropriately! | |
− | + | * Combine linne and persephone clusters | |
− | + | ** dependent on finishing linne-to-bx migration | |
− | + | ** master/slave SGE qmasters running somewhere central and more reliable (currently c1.persephone and linne) | |
− | + | * tsm backups of s2 and s3? | |
− | + | * Replace BioTeam iNquiry | |
− | + | ** Use Galaxy instead? | |
− | + | * Implement a centralized database of sequencing run information. | |
− | + | ** maybe generate this based on the filesystem layout and the presense/absence of certain files? | |
− | + | ** maybe use this for generating notifications so people know when certain parts of the pipeline are done? | |
− | + | ** Basically a small LIMS. | |
− | + | ** Maybe integrate with galaxy | |
− | + | * After problems with md1k-2 are fixed, turn on automated scrubbing. | |
− | + | * clean up old files in /afs/bx.psu.edu/depot/data/schuster_lab/old_stuff_to_cleanup | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− |
Latest revision as of 13:47, 3 June 2010
in progress
- md1k-2 disk problems - currently waiting for problems to show up again to get fresh RAID controller log entries, and to verify that switching to the spare EMM (array controller) didn't fix the problem
- nagios monitoring (http://kaylee.bx.psu.edu/nagios [login as guest/guest])
- who gets notified? everyone all at once, or use elapsed-time-based escalations?
- what do we want to monitor?
- (DONE) samba share status on s2 so illumina-ga can copy data!
- up/down state for all nodes and servers
- (DONE) nfs server on s2/s3 and PARTIAL all the related tcp/udp ports necessary for proper nfs operation
- disk usage via snmp for s2+s3
- (DONE) fault management via FMD over SNMP like we do for afs-fs{4..7}, thumper, saturn....
- SGE queue status
- sequencer up/down state, and maybe disk usage (can we do that for illumina-ga remotely somehow?)
- Migrate linne to bx network. See Slab:Linne_BX_migration
- install AFS client on all nodes
- (PARTIAL) finish sync'ing uid's/gid's to match what is in BX LDAP
- create BX accounts for those that don't already have them (cleanup/disable linne accounts that are no longer necessary for security reasons?)
- point all nodes to ldap.bx.psu.edu for authZ, and switch to the BX.PSU.EDU krb5 realm for authN
- disable the services running on linne that are no longer necessary
queued
- attach the UPSes to s2 and s3 to enable graceful shutdown in the event of a power outage?
- automatic snapshots for the ZFS datasets on s2 and s3 (see http://blogs.sun.com/timf/resource/README.zfs-auto-snapshot.txt)
- more scripts:
- migrate sequencing runs from temp to staging (currently /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/support/software/archive/move_*_temp_to_staging)
- perhaps notify by email automatically when there are finished runs ready to be moved?
- notify by email when this is done so any interested parties will see that it has been done, and provide paths to new runs
- this should call a script to update symlinks and release the data.schuster_lab volume
- script to better handle submitting illumina jobs to cluster, with email notifications
- script to allow rsync'ing individual lanes from a run, given source directory, dest directory, and lane(s)
- migrate sequencing runs from temp to staging (currently /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/support/software/archive/move_*_temp_to_staging)
- Migrate Schuster Lab machines to bx network.
- Or at least, install AFS client and setup BX.PSU.EDU krb5 realm to handle authentication so it's easier for the schuster lab machines to work with everyone else
- Automate the archiving of sequencing run directories.
- Maybe after two weeks in staging they're moved into the archive?
- need to keep symlinks up-to-date and release the data.schuster_lab volume appropriately!
- Combine linne and persephone clusters
- dependent on finishing linne-to-bx migration
- master/slave SGE qmasters running somewhere central and more reliable (currently c1.persephone and linne)
- tsm backups of s2 and s3?
- Replace BioTeam iNquiry
- Use Galaxy instead?
- Implement a centralized database of sequencing run information.
- maybe generate this based on the filesystem layout and the presense/absence of certain files?
- maybe use this for generating notifications so people know when certain parts of the pipeline are done?
- Basically a small LIMS.
- Maybe integrate with galaxy
- After problems with md1k-2 are fixed, turn on automated scrubbing.
- clean up old files in /afs/bx.psu.edu/depot/data/schuster_lab/old_stuff_to_cleanup