BX:SGE
Overview
There is one central SGE installation which handles job scheduling across all of the BX clusters, work servers, and workstations (with the exception of the okinawa, linne, and galaxy clusters). Merging existing clusters, work servers, and workstations is still a work-in-progress project.
The central grid engine has a pair of fully redundant master servers to ensure continuous job scheduling. The loss of both sge masters does not kill jobs that are currently running or queued, but will prevent any further job submissions. There is an approximately 5 minute failover period between sge master failure and the startup of the other sge master.
Status
- Current BX Grid load can be seen through GANGLIA at http://ganglia.bx.psu.edu
- A web version of qstat (XSL formatted version of qstat -f -u '*' -xml) is available at http://qstat.bx.psu.edu
Usage
To submit a job, put the command(s) into a script, and use qsub.
Various job resource requirements can be specified with -l resource=foo. Careful consideration should be given to your job's resource requirements. Specifying arch, mem_free, s_vmem, and slots for threaded jobs is essential to ensure your job does not over-subscribe on resources, and runs to completion. SGE cannot predict what resources your job requires, for example it cannot predict how much memory your job will require, so it might schedule it on a node that has far less memory than necessary, causing the node to wedge itself or the job to die.
SGE host status can be seen with qhost
Job queue/status can be seen with qstat -f, which will show just your jobs. To see everyone's jobs, qstat -f -u '*'. Note that qstat behaves different than previous versions of SGE.
For more detailed usage and examples, please see the SGE Documentation Site: SGE 6.2u5 documentation