SLab:Run Processing

From CCGB
Revision as of 17:08, 12 May 2010 by Phalenor (talk | contribs) (Illumina)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Give persephone cluster write access to the output directory

Most of the run processing scripts write standard output and standard error into the $HOME/sge-out directory of the person submitting the job. The persephone cluster needs to be given permission to write into this directory. This only needs to be done once, not for every run. The following commands configure the necessary permissions:

% mkdir $HOME/sge-out
% fs setacl -dir $HOME -acl svc/sge/persephone l
% fs setacl -dir $HOME/sge-out -acl svc/sge/persephone rliw

454

rigs
  • schuster-flx1
  • schuster-flx2
  • schuster-flx3
  • schuster-flx4

on-rig processing

  • run directories are stored in /data
    • /data/YYYY_MM_DD/R_YYYY_MM_DD_HH_MM_SS_RIGNAME_OPERATOR_RUNNAME
  • when a run finishes processing, it calls the /usr/local/rig/bin/postAnalysisScript.sh script
    • rsync's run directory to s2:/zfs/md1k-4/data/sequencing/temp/454
    • ssh's to c1.persephone to submit job
      • depending on run
        • calls c1.persephone:/usr/local/bin/submit-signalProcessing.sh
        • calls c1.persephone:/usr/local/bin/submit-fullProcessing.sh
    • status email is sent to 454pipeline@bx.psu.edu
    • our postAnalysisScript.sh is kept in /home/adminrig/postAnalysisScript directory on each rig
      • revision controlled using rcs
        •  % co -l postAnalysisScript.sh
        •  % vi postAnalysisScript.sh
        •  % ci -u postAnalysisScript.sh
      • Makefile in this directory installs our version into /usr/local/rig/bin
        •  % make install

signal processing

  • signal processing for runs is performed on the persephone cluster
  • depending on run
    • uses qsub to submit job using c1.persephone:/usr/local/bin/signalProcessing.qsub
    • uses qsub to submit job using c1.persephone:/usr/local/bin/fullProcessing.qsub
  • status email is sent to 454pipeline@bx.psu.edu

Before exiting signal processing jobs signal that processing is done by touching a file with the same name as the run directory:

/afs/bx.psu.edu/depot/data/schuster_lab/sequencing/temp/454/.processing_finished/RUN_DIR_NAME

staging

A cron job on s2 checks the /zfs/md1k-4/data/sequencing/temp/454/.processing_finished directory once a minute to see if any signal processing jobs have finished. When it finds a finished signal processing job, it moves it to the staging directory:

/afs/bx.psu.edu/depot/data/schuster_lab/sequencing/staging/454/RUN_DIR

Once the run has been copied to the staging directory, the files in the run directory are modified as needed to make sure they have the correct owner, group, and permissions.

archive

To archive a run, it needs to be moved into one of the archive folders (md1k-1, md1k-2, md1k-3 on s3) or (md1k-4, md1k-5, md1k-6 on s2).

The /zfs/md1k-N/archive filesystem is compressed and exported read-only.

s3:/zfs/md1k-{1,2,3}/archive/sequencing/454/YYYY/YYYY_MM_DD/
s2:/zfs/md1k-{4,5,6}/archive/sequencing/454/YYYY/YYYY_MM_DD/

After the run has been archived, the links in the following directory need to be modified to reflect the location of the run.

/afs/bx.psu.edu/depot/data/schuster_lab/sequencing/archive/454

Illumina

systems
  • illumina-ga

on-system processing

Samba is running on s2.persephone so that the Illumina GA can copy it's data directly to s2.

  • The Illumina GA copies it's data to one of three locations (decided by the operator)
    • s2.persephone\\illumina-4 which is on s2.persephone:/zfs/md1k-4/data/illumina
    • s2.persephone\\illumina-5 which is on s2.persephone:/zfs/md1k-5/data/illumina
    • s2.persephone\\illumina-6 which is on s2.persephone:/zfs/md1k-5/data/illumina
  • Using the current software, both image analysis and base calling are performed on-system
    • SCS2.5/RTA1.5
    • SCS2.6/RTA1.6
  • Samba configuration file: /etc/sfw/smb.conf
  • To determine samba process state: svcs -xv samba
  • To restart samba: svcadm restart samba
  • To enable/disable samba: svcadm enable samba / svcadm disable samba
  • Logs are stored in /var/adm/samba/<ip>.log

Samba shares

Note: These were taken on 2010/05/12, may not represent current definitions.

[illumina-4]
   comment = Illumina
   path = /zfs/md1k-4/data/illumina
   public = yes
   writable = yes
   printable = no
   force group = illumina-data
   create mask = 0775
   directory mask = 0775

[illumina-5]
   comment = Illumina
   path = /zfs/md1k-5/data/illumina
   public = yes
   writable = yes
   printable = no
   force group = illumina-data
   create mask = 0775
   directory mask = 0775

[illumina-6]
   comment = Illumina
   path = /zfs/md1k-6/data/illumina
   public = yes
   writable = yes
   printable = no
   force group = illumina-data
   create mask = 0775
   directory mask = 0775

staging

After we receive an email from the operator informing us that a tun has completed, we copy it to the staging directory using the following commands:

% ssh s1.persephone.bx.psu.edu
% /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/support/software/archive/move_illumina_temp_to_staging

signal processing

When runs come off of the Illumina GA, their images have already been processed and bases have already been called. Scripts for processing Illumina runs can be found here:

/afs/bx.psu.edu/depot/data/schuster_lab/sequencing/support/illumina/submit

The firecrest directory is for reprocessing images, the bustard directory is for recalling bases, and the gerald directory is for aligning reads to a reference genome. Each directory has a doit script with a sample invocation. The submit scripts should be ron on c1.persephone.bx.psu.edu.

% ssh c1.persephone
c1% cd /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/support/illumina/submit/gerald
c1% ./doit

Before aligning reads to a reference genome (by using GERALD), you need to create an appropriate GERALD configuration file. We've been placing these configuration files inside the run directories in a file called config.txt

GERALD config file for 100318_HWUSI-EAS610_0009

1278:ELAND_GENOME /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/staging/illumina/reference/mm8
4:ELAND_GENOME /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/staging/illumina/reference/VaMs102
12478:ANALYSIS eland_extended
356:ANALYSIS sequence
USE_BASES Y36
QCAL_SOURCE upstream
ELAND_SET_SIZE 60
EMAIL_LIST illumina-pipeline@bx.psu.edu
EMAIL_SERVER smtp
EMAIL_DOMAIN bx.psu.edu
WEB_DIR_ROOT https://badger.bx.psu.edu/illumina

GERALD config file for 100211_HWUSI-EAS610_0005

USE_BASES Y76,Y76
1235678:ANALYSIS sequence_pair
4:ANALYSIS eland_pair
4:ELAND_GENOME /afs/bx.psu.edu/depot/data/schuster_lab/sequencing/support/illumina/reference/hg18
QCAL_SOURCE upstream
ELAND_SET_SIZE 60
EMAIL_LIST illumina-pipeline@bx.psu.edu
EMAIL_SERVER smtp
EMAIL_DOMAIN bx.psu.edu
WEB_DIR_ROOT https://badger.bx.psu.edu/illumina

For more examples see:

/afs/bx.psu.edu/depot/data/schuster_lab/sequencing/archive/illumina/flat/*/config.txt

archive

To archive a run, it needs to be moved into one of the archive folders (md1k-1, md1k-2, md1k-3 on s3) or (md1k-4, md1k-5, md1k-6 on s2).

The /zfs/md1k-N/archive filesystem is compressed and exported read-only.

s3:/zfs/md1k-{1,2,3}/archive/sequencing/illumina/YYYY/YYYY_MM_DD/
s2:/zfs/md1k-{4,5,6}/archive/sequencing/illumina/YYYY/YYYY_MM_DD/

After the run has been archived, the links in the following directory need to be modified to reflect the location of the run.

/afs/bx.psu.edu/depot/data/schuster_lab/sequencing/archive/illumina