Difference between revisions of "BX:SGE"

From CCGB
Jump to: navigation, search
(Complexes)
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
= Overview =
 
= Overview =
 +
SGE (Sun Grid Engine) is a sophisticated batch queuing system responsible for accepting, scheduling, dispatching, and managing the remote and distributed execution of large numbers of standalone, parallel or interactive user jobs. It also manages and schedules the allocation of distributed resources such as processors, memory, and disk space. It can take some time to learn to use if you're used to a straight-forward queue system where you enter commands, one after the other, in a terminal window to execute them. It has many advantages over a simple queue, though, which you will see as you learn to use it.
  
 
There is one central SGE installation which handles job scheduling across all of the BX clusters, work servers, and workstations (with the exception of the okinawa, linne, and galaxy clusters). Merging existing clusters, work servers, and workstations is still a work-in-progress project.  
 
There is one central SGE installation which handles job scheduling across all of the BX clusters, work servers, and workstations (with the exception of the okinawa, linne, and galaxy clusters). Merging existing clusters, work servers, and workstations is still a work-in-progress project.  
Line 9: Line 10:
 
= Status =
 
= Status =
  
* Current ''BX Grid'' load can be seen through GANGLIA at http://ganglia.bx.psu.edu
+
* Current and past ''BX Grid'' load (metrics including cpu usage, memory usage and network usage) can be seen through GANGLIA at http://ganglia.bx.psu.edu  
* A web version of qstat (XSL formatted version of ''qstat -f -u '*' -xml) is available at http://qstat.bx.psu.edu
+
* The status of the current queue can be examined by running ''`qstat -f -u '*'`''. A web version of qstat (XSL formatted version of ''qstat -f -u '*' -xml) is available at http://qstat.bx.psu.edu.
 +
 
 +
= Scheduler =
 +
 
 +
The scheduling algorithms in SGE are quite complex, taking many factors into configuration. It is also quite complex to configure from an administrator standpoint, but we think we've found a good compromise in the current configuration to ensure everyone gets their fair share of the system.
 +
 
 +
Without going into too much detail, all you need to know is that we're using a fair-share configuration. Job priority for jobs in the qw state (signalling that the job is waiting for resources to be available) is based on your current usage of the cluster, and the usage of all the users on the system. If you're using more of the cluster than someone else, and you have 100 jobs waiting to run, and one other user only has 1 job waiting to run, their job will be scheduled before your other jobs. The length of time that a job has been in the queue is also considered.
 +
 
 +
Cluster usage is calculated by CPU, memory, and IO. Currently CPU accounts for 80%, memory is 10%, and IO is 10%.
 +
 
 +
Scheduling within your own jobs is FIFO (First in, First out). However, you can specify priority within your own scheduled jobs with '''-p N''' where -100<N<0. The default priority for jobs is -100 (normal users can't specify a priory > 0).
  
 
= Usage =
 
= Usage =
Line 16: Line 27:
 
== Job submission ==
 
== Job submission ==
  
To submit a job, put the command(s) into a script, and use '''qsub'''.  
+
To submit a job, put the command(s) into a script, and use '''qsub''' to submit the script.  
  
Various job resource requirements can be specified with '''-l resource=foo'''. Careful consideration should be given to your job's resource requirements. Specifying ''arch'', ''mem_free'', ''s_vmem'', and ''slots'' for threaded jobs is essential to ensure your job does not over-subscribe on resources, and runs to completion. SGE cannot predict what resources your job requires, for example it cannot predict how much memory your job will require, so it might schedule it on a node that has far less memory than necessary, causing the node to wedge itself or the job to die.
+
Various job resource requirements can be specified with '''-l resource=foo''' (See section on Complexes). Careful consideration should be given to your job's resource requirements. Specifying resources such as ''arch'', ''mem_free'', ''s_vmem'', and ''slots'' for threaded jobs is essential to ensure your job does not over-subscribe on resources, and runs to completion. SGE cannot predict what resources your job requires, for example it cannot predict how much memory your job will require, so it might schedule it on a node that has far less memory than necessary, causing the node to wedge itself or the job to die.
  
 
For more detailed usage and examples, please see the SGE Documentation Site:
 
For more detailed usage and examples, please see the SGE Documentation Site:
 
[http://wikis.sun.com/display/gridengine62u5/Using SGE 6.2u5 documentation]
 
[http://wikis.sun.com/display/gridengine62u5/Using SGE 6.2u5 documentation]
 +
 +
== Interactive Jobs ==
 +
Instead of ssh'ing to a particular node, you should use '''qlogin'''. qlogin takes all of the same options as qsub, except you don't supply a job script. When your qlogin job starts, you will be put into a shell on the node that meets the requirements specified by your qlogin command. From there, you may run whatever commands you want.
  
 
== Job monitoring ==
 
== Job monitoring ==
  
SGE host status can be seen with '''qhost'''
+
SGE host status can be seen with '''qhost'''. That will give you information about the architecture, number of CPU's, available memory and the current load for all the nodes in the clusters.
  
 
Job queue/status can be seen with '''qstat -f''', which will show just your jobs. To see everyone's jobs, '''qstat -f -u '*''''. Note that qstat behaves different than previous versions of SGE, where the new default is to show only your jobs.
 
Job queue/status can be seen with '''qstat -f''', which will show just your jobs. To see everyone's jobs, '''qstat -f -u '*''''. Note that qstat behaves different than previous versions of SGE, where the new default is to show only your jobs.
Line 45: Line 59:
  
 
== Complexes ==
 
== Complexes ==
At job submission time, you should specify the job requirements. The better you can describe the requirements, the better Grid Engine will be able to find a host or series of hosts to run your job. If you don't do this, your job will more than likely run out of memory, not have enough CPU, not have enough scratch space, run on a node that has a slow network connection, etc.
+
See ''complex(5)'' and ''sge_types(1)'' for full explanations.
 +
 
 +
At job submission time, you should specify the job requirements. The better you can describe the requirements, the better Grid Engine will be able to find a host or series of hosts to run your job. If you don't do this, your job will more than likely run out of memory, not have enough CPU, not have enough scratch space, run on a node that has a slow network connection, etc.  
 +
 
 +
Complexes can be specified in a combination of command line arguments and submission script comments.
 +
In a qsub script:
 +
<pre>#$ -l arch=lx24-amd64
 +
#$ -l mem_tokens=1G
 +
#$ -l mem_free=1G
 +
 
 +
some_command
 +
</pre>
 +
 
 +
As command arguments:
 +
<pre>qsub -l arch=lx24-amd64 -l mem_tokens=1G -l mem_free=1G myscript.qsub</pre>
 +
 
  
 
Some of the more common complexes that are used:
 
Some of the more common complexes that are used:
Line 51: Line 80:
 
! complex !! type !! consumable !! description !! example
 
! complex !! type !! consumable !! description !! example
 
|-
 
|-
| arch || string || no || Specifies CPU architecture and OS || lx24-amd64
+
| arch || regex string || no || Specifies CPU architecture and OS || lx24-amd64
 
|-
 
|-
 
| mem_tokens || memory || yes || Think of this as a queue slot, but for memory. Should be used along with ''mem_free'' for best results. || 2G
 
| mem_tokens || memory || yes || Think of this as a queue slot, but for memory. Should be used along with ''mem_free'' for best results. || 2G
Line 61: Line 90:
 
| scratch_free || memory || no || Amount of free space in /scratch. Updated every 5 minutes. Only observed at job start time || 50G
 
| scratch_free || memory || no || Amount of free space in /scratch. Updated every 5 minutes. Only observed at job start time || 50G
 
|}
 
|}
 +
 +
Memory values:
 +
* lowercase suffixes denote powers of 1000 (k, m, g).
 +
* uppercase suffixes denote powers of 1024 (K, M, G).
 +
 +
Use '''qhost -F [-h <hostname>]''' to see current complex values for a particular host. Example output:
 +
<pre>$ qhost -F -h townsend
 +
HOSTNAME                ARCH        NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
 +
-------------------------------------------------------------------------------
 +
global                  -              -    -      -      -      -      -
 +
townsend.bx.psu.edu    lx24-amd64      2  0.00    3.7G    2.4G    1.0G    1.0G
 +
  hl:arch=lx24-amd64
 +
  hl:num_proc=2.000000
 +
  hl:mem_total=3.739G
 +
  hl:swap_total=1.004G
 +
  hl:virtual_total=4.742G
 +
  hl:load_avg=0.000000
 +
  hl:load_short=0.000000
 +
  hl:load_medium=0.000000
 +
  hl:load_long=0.000000
 +
  hl:mem_free=1.366G
 +
  hl:swap_free=3.199M
 +
  hl:virtual_free=1.369G
 +
  hl:mem_used=2.372G
 +
  hl:swap_used=1.000G
 +
  hl:virtual_used=3.373G
 +
  hl:cpu=0.100000
 +
  hl:m_topology=SCC
 +
  hl:m_topology_inuse=SCC
 +
  hl:m_socket=1.000000
 +
  hl:m_core=2.000000
 +
  hl:np_load_avg=0.000000
 +
  hl:np_load_short=0.000000
 +
  hl:np_load_medium=0.000000
 +
  hl:np_load_long=0.000000
 +
  hl:netspeed=95.367M
 +
  hl:scratch_free=884.541G
 +
  hl:scratch_total=884.548G
 +
  hl:scratch_used=7.233M
 +
  hc:mem_tokens=4.000G</pre>
 +
 +
Note: complex reporting is always done in powers of 1024. In this case, netspeed is 95.367M, or 95.367*2^(2*10). To get this back to decimal bits from binary bits, we simply multipley 95.367 by 1024*1024, then divide by 1000*1000 to get 99.99m, or 100m, so we see that this machine is on a 100mbit connection.
  
 
== Disk space ==
 
== Disk space ==
Line 75: Line 146:
  
 
You should not use your AFS home directory for any high throughput IO. The AFS home directory servers are fast, but are certainly not fast enough to serve HPC needs (with the exception of the dedicated bare-hardware AFS bulk storage servers). Running programs, scripts, reading small files, writing small output files should be fine, though. AFS does tend to break down when there are multiple writers to the same volume/directory, so keep this in mind. In most cases, the various NFS storage systems should be preferred to AFS.
 
You should not use your AFS home directory for any high throughput IO. The AFS home directory servers are fast, but are certainly not fast enough to serve HPC needs (with the exception of the dedicated bare-hardware AFS bulk storage servers). Running programs, scripts, reading small files, writing small output files should be fine, though. AFS does tend to break down when there are multiple writers to the same volume/directory, so keep this in mind. In most cases, the various NFS storage systems should be preferred to AFS.
 +
 +
<acl>ratan a</acl>

Latest revision as of 15:15, 6 May 2011

Overview

SGE (Sun Grid Engine) is a sophisticated batch queuing system responsible for accepting, scheduling, dispatching, and managing the remote and distributed execution of large numbers of standalone, parallel or interactive user jobs. It also manages and schedules the allocation of distributed resources such as processors, memory, and disk space. It can take some time to learn to use if you're used to a straight-forward queue system where you enter commands, one after the other, in a terminal window to execute them. It has many advantages over a simple queue, though, which you will see as you learn to use it.

There is one central SGE installation which handles job scheduling across all of the BX clusters, work servers, and workstations (with the exception of the okinawa, linne, and galaxy clusters). Merging existing clusters, work servers, and workstations is still a work-in-progress project.

The central grid engine has a pair of fully redundant master servers to ensure continuous job scheduling. The loss of both sge masters does not kill jobs that are currently running or queued, but will prevent any further job submissions. There is an approximately 5 minute failover period between sge master failure and the startup of the other sge master.

SGE can be used from any BX-maintained machine. Use `which qsub` to determine if you have access to the BX SGE grid. If not, try sourcing /afs/bx.psu.edu/service/sge/prod/default/common/settings.sh if you're using bash, or /afs/bx.psu.edu/service/sge/prod/default/common/settings.csh if you're using tcsh. If you see a message saying your machine is not an administrative host or submission host, send an email to admin-at-bx.psu.edu with the name of the host to have it added.

Status

  • Current and past BX Grid load (metrics including cpu usage, memory usage and network usage) can be seen through GANGLIA at http://ganglia.bx.psu.edu
  • The status of the current queue can be examined by running `qstat -f -u '*'`. A web version of qstat (XSL formatted version of qstat -f -u '*' -xml) is available at http://qstat.bx.psu.edu.

Scheduler

The scheduling algorithms in SGE are quite complex, taking many factors into configuration. It is also quite complex to configure from an administrator standpoint, but we think we've found a good compromise in the current configuration to ensure everyone gets their fair share of the system.

Without going into too much detail, all you need to know is that we're using a fair-share configuration. Job priority for jobs in the qw state (signalling that the job is waiting for resources to be available) is based on your current usage of the cluster, and the usage of all the users on the system. If you're using more of the cluster than someone else, and you have 100 jobs waiting to run, and one other user only has 1 job waiting to run, their job will be scheduled before your other jobs. The length of time that a job has been in the queue is also considered.

Cluster usage is calculated by CPU, memory, and IO. Currently CPU accounts for 80%, memory is 10%, and IO is 10%.

Scheduling within your own jobs is FIFO (First in, First out). However, you can specify priority within your own scheduled jobs with -p N where -100<N<0. The default priority for jobs is -100 (normal users can't specify a priory > 0).

Usage

Job submission

To submit a job, put the command(s) into a script, and use qsub to submit the script.

Various job resource requirements can be specified with -l resource=foo (See section on Complexes). Careful consideration should be given to your job's resource requirements. Specifying resources such as arch, mem_free, s_vmem, and slots for threaded jobs is essential to ensure your job does not over-subscribe on resources, and runs to completion. SGE cannot predict what resources your job requires, for example it cannot predict how much memory your job will require, so it might schedule it on a node that has far less memory than necessary, causing the node to wedge itself or the job to die.

For more detailed usage and examples, please see the SGE Documentation Site: SGE 6.2u5 documentation

Interactive Jobs

Instead of ssh'ing to a particular node, you should use qlogin. qlogin takes all of the same options as qsub, except you don't supply a job script. When your qlogin job starts, you will be put into a shell on the node that meets the requirements specified by your qlogin command. From there, you may run whatever commands you want.

Job monitoring

SGE host status can be seen with qhost. That will give you information about the architecture, number of CPU's, available memory and the current load for all the nodes in the clusters.

Job queue/status can be seen with qstat -f, which will show just your jobs. To see everyone's jobs, qstat -f -u '*'. Note that qstat behaves different than previous versions of SGE, where the new default is to show only your jobs.

It is highly recommended that job notification be turned on depending on how many jobs you will be submitting. This can be done with the SGE variable settings as seen below:

#$ -M youremail@bx.psu.edu
#$ -m beas

This will send email when your job starts, finishes, is suspended, or aborted or rescheduled.

Queue permissions

Anyone can submit jobs to all.q. However, permission is restricted at the host and hostgroup level. There are user groups that control who's jobs can run on particular groups of hosts. There are also groups to further restrict access to individual hosts within a hostgroup.

Currently, the following host groups are defined, along with the user group that can access them:

  • Persephone cluster (c1-14) : persephone_users
  • Zhang Lab (townsend) : zhang_lab
  • c14 : c14_users

Complexes

See complex(5) and sge_types(1) for full explanations.

At job submission time, you should specify the job requirements. The better you can describe the requirements, the better Grid Engine will be able to find a host or series of hosts to run your job. If you don't do this, your job will more than likely run out of memory, not have enough CPU, not have enough scratch space, run on a node that has a slow network connection, etc.

Complexes can be specified in a combination of command line arguments and submission script comments. In a qsub script:

#$ -l arch=lx24-amd64
#$ -l mem_tokens=1G
#$ -l mem_free=1G

some_command

As command arguments:

qsub -l arch=lx24-amd64 -l mem_tokens=1G -l mem_free=1G myscript.qsub


Some of the more common complexes that are used:

complex type consumable description example
arch regex string no Specifies CPU architecture and OS lx24-amd64
mem_tokens memory yes Think of this as a queue slot, but for memory. Should be used along with mem_free for best results. 2G
mem_free memory no Specifies amount of free memory on a node. Should be used along with mem_tokens for best results. 2G
netspeed memory no Specifies speed of primary network interface 1g, 100m
scratch_free memory no Amount of free space in /scratch. Updated every 5 minutes. Only observed at job start time 50G

Memory values:

  • lowercase suffixes denote powers of 1000 (k, m, g).
  • uppercase suffixes denote powers of 1024 (K, M, G).

Use qhost -F [-h <hostname>] to see current complex values for a particular host. Example output:

$ qhost -F -h townsend
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
townsend.bx.psu.edu     lx24-amd64      2  0.00    3.7G    2.4G    1.0G    1.0G
   hl:arch=lx24-amd64
   hl:num_proc=2.000000
   hl:mem_total=3.739G
   hl:swap_total=1.004G
   hl:virtual_total=4.742G
   hl:load_avg=0.000000
   hl:load_short=0.000000
   hl:load_medium=0.000000
   hl:load_long=0.000000
   hl:mem_free=1.366G
   hl:swap_free=3.199M
   hl:virtual_free=1.369G
   hl:mem_used=2.372G
   hl:swap_used=1.000G
   hl:virtual_used=3.373G
   hl:cpu=0.100000
   hl:m_topology=SCC
   hl:m_topology_inuse=SCC
   hl:m_socket=1.000000
   hl:m_core=2.000000
   hl:np_load_avg=0.000000
   hl:np_load_short=0.000000
   hl:np_load_medium=0.000000
   hl:np_load_long=0.000000
   hl:netspeed=95.367M
   hl:scratch_free=884.541G
   hl:scratch_total=884.548G
   hl:scratch_used=7.233M
   hc:mem_tokens=4.000G

Note: complex reporting is always done in powers of 1024. In this case, netspeed is 95.367M, or 95.367*2^(2*10). To get this back to decimal bits from binary bits, we simply multipley 95.367 by 1024*1024, then divide by 1000*1000 to get 99.99m, or 100m, so we see that this machine is on a 100mbit connection.

Disk space

Most if not all nodes have a /scratch directory which should be used in lieu of /tmp or /var/tmp for temporary job output.

You can specify -l scratch_free=xG where x is the amount of free space in /scratch.

AFS considerations

Without going into too much technical detail, when you submit a job, SGE 'steals' your Kerberos tickets and sends them along with the job, and uses them to obtain AFS tokens right before the job starts. Your Kerberos tickets, and hence your AFS tokens, have a default lifetime of 14 days, renewable up to 30 days. It is essential that you pay attention to your current ticket expiration date before submitting a long running job, or a job that will sit in the queue for a significant period of time. Ticket expiration time can be seen in the output of klist.

If you submit a job without currently posessing valid Kerberos tickets for your user, then your job will not have authenticated access to AFS, and hence your home directory.

You should not use your AFS home directory for any high throughput IO. The AFS home directory servers are fast, but are certainly not fast enough to serve HPC needs (with the exception of the dedicated bare-hardware AFS bulk storage servers). Running programs, scripts, reading small files, writing small output files should be fine, though. AFS does tend to break down when there are multiple writers to the same volume/directory, so keep this in mind. In most cases, the various NFS storage systems should be preferred to AFS.

'"`UNIQ--acl-00000004-QINU`"'