Revision as of 16:04, 3 September 2010

Overview

There is one central SGE installation which handles job scheduling across all of the BX clusters, work servers, and workstations (with the exception of the okinawa, linne, and galaxy clusters). Merging existing clusters, work servers, and workstations is still a work-in-progress project.

The central grid engine has a pair of fully redundant master servers to ensure continuous job scheduling. The loss of both sge masters does not kill jobs that are currently running or queued, but will prevent any further job submissions. There is an approximately 5 minute failover period between sge master failure and the startup of the other sge master.

Status

Current BX Grid load can be seen through GANGLIA at http://ganglia.bx.psu.edu
A web version of qstat (XSL formatted version of qstat -f -u '*' -xml) is available at http://qstat.bx.psu.edu

Usage

To submit a job, put the command(s) into a script, and use qsub.

Various job resource requirements can be specified with -l resource=foo. Careful consideration should be given to your job's resource requirements. Specifying arch, mem_free, s_vmem, and slots for threaded jobs is essential to ensure your job does not over-subscribe on resources, and runs to completion. SGE cannot predict what resources your job requires, for example it cannot predict how much memory your job will require, so it might schedule it on a node that has far less memory than necessary, causing the node to wedge itself or the job to die.

SGE host status can be seen with qhost

Job queue/status can be seen with qstat -f, which will show just your jobs. To see everyone's jobs, qstat -f -u '*'. Note that qstat behaves different than previous versions of SGE.

For more detailed usage and examples, please see the SGE Documentation Site: SGE 6.2u5 documentation

@@ Line 2: / Line 2: @@
 There is one central SGE installation which handles job scheduling across all of the BX clusters, work servers, and workstations (with the exception of the okinawa, linne, and galaxy clusters). Merging existing clusters, work servers, and workstations is still a work-in-progress project.
 The central grid engine has a pair of fully redundant master servers to ensure continuous job scheduling. The loss of both sge masters does not kill jobs that are currently running or queued, but will prevent any further job submissions. There is an approximately 5 minute failover period between sge master failure and the startup of the other sge master.
@@ Line 13: / Line 12: @@
 = Usage =
-To submit a job, put the command(s) into a script, and use qsub. Various job resource requirements can be specified with '''-l resource=foo'''.
+To submit a job, put the command(s) into a script, and use '''qsub'''.
+Various job resource requirements can be specified with '''-l resource=foo'''. Careful consideration should be given to your job's resource requirements. Specifying ''arch'', ''mem_free'', ''s_vmem'', and ''slots'' for threaded jobs is essential to ensure your job does not over-subscribe on resources, and runs to completion. SGE cannot predict what resources your job requires, for example it cannot predict how much memory your job will require, so it might schedule it on a node that has far less memory than necessary, causing the node to wedge itself or the job to die.
 SGE host status can be seen with '''qhost'''

Difference between revisions of "BX:SGE"

Revision as of 16:04, 3 September 2010

Overview

Status

Usage

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools