Difference between revisions of "BX:Slurm Cluster"

From CCGB
Jump to: navigation, search
m (Changed "not working yet" from bold to italic)
(Added things to explain: slots and checkpoints)
 
(One intermediate revision by the same user not shown)
Line 153: Line 153:
 
{things to explain}
 
{things to explain}
 
* jobs vs. tasks vs. steps
 
* jobs vs. tasks vs. steps
* nodes vs. processors vs. cpus vs. sockets/cores/threads
+
* nodes vs. processors vs. cpus vs. sockets/cores/threads vs. "slots"
 
* what does "the current node" mean, e.g. in source for sbcast?<br />
 
* what does "the current node" mean, e.g. in source for sbcast?<br />
 
*: (Use path relative to nodes.)
 
*: (Use path relative to nodes.)
Line 179: Line 179:
 
* <tt>/space</tt> vs. <tt>scratch4</tt>
 
* <tt>/space</tt> vs. <tt>scratch4</tt>
 
* <tt>sbcast</tt> slurm command
 
* <tt>sbcast</tt> slurm command
 +
 +
<br />
 +
{consider adding checkpoints and/or saving intermediate results, so that if a job
 +
is killed or hits an error, you won't have to start it all over from the beginning}
  
 
<br />
 
<br />
Line 186: Line 190:
 
<br />
 
<br />
 
{converting job scripts from PBS and other systems to slurm}
 
{converting job scripts from PBS and other systems to slurm}
 +
  
 
==Available software==
 
==Available software==
Line 273: Line 278:
  
 
{extending a job's time limit}<br />
 
{extending a job's time limit}<br />
<tt>scontrol update JobId=''&lt;id&gt;'' TimeLimit=&#91;+&#93;''&lt;days&gt;''-''&lt;hours&gt;''</tt>
+
<tt>scontrol update JobId=''&lt;id#&gt;'' TimeLimit=&#91;+&#93;''&lt;days&gt;''-''&lt;hours&gt;''</tt>
  
 
{running usage reports, by user and PI}
 
{running usage reports, by user and PI}

Latest revision as of 21:00, 27 October 2014


Introduction

The new BX cluster currently consists of 16 identical nodes (called nn0 - nn15), each with 64 cores, about 500G of available RAM, and a half-terabyte of local disk space for temporary files, all connected via gigabit ethernet. These are located in the ITS co-location facility, along with several large disk arrays for longer-term data storage. There are also some older nodes (n0 - n18), which are located in one of the server rooms in the MSC building.

The cluster is intended primarily for batch processing of computationally intensive analysis pipelines. While it is possible to access the nodes interactively, the cluster is generally not the best place to compose manuscripts, develop programs, or even to view data plots.

The operating system on the nodes is Debian Linux, and the job scheduler is called SLURM (Simple Linux Utility for Resource Management). Slurm plays a similar role to that of PBS or SGE on other clusters you may have used.

Note that this cluster doesn't have its own distinctive name yet; suggestions are welcome. In the meantime, you may hear it referred to as "the new BX cluster", "the Slurm cluster", "the Scofield cluster" (scofield is actually a separate machine that is used to access it), or "the Galaxy cluster" (since this hardware was previously used to run the main Galaxy server).


Accounts

Currently the cluster resides on a separate network (called "galaxyproject.org" for historical reasons); thus you need a separate account to access it. To get one, you must be sponsored by one of the BX faculty.

Also, to use it effectively you'll need to be fluent with Unix/Linux terminal commands (cd, ls, cp, rm, ...), a text editor (e.g. vi, emacs, or nano), remote login via ssh, and scp or rsync to transfer files to and from this account. We're not going to give you a quiz, but if you don't know these things, you should learn them first.

Once you're ready, make your request via email to admin-at-bx.psu.edu. Include your PSU userid (abc123) and BX account name if you have those, and be sure to cc: your sponsor.


Cost

For the time being, there is no charge to use this cluster. It is a shared resource, supported by various contributions from the BMB and Biology departments, Anton Nekrutenko, other BX faculty, and the Huck Institutes.

However, we will be tracking both CPU usage and disk space for each account, so if there are any complaints about hogging the system, the sponsors are responsible to their peers.


Logging in

To begin, ssh to scofield.bx.psu.edu. This machine is not technically part of the cluster, but it serves as an access point for using it, since the individual nodes have private IP addresses and are only accessible from within the galaxyproject.org network. You'll be submitting your cluster jobs from scofield using slurm commands.

With this account you have two home directories; these are separate from each other and from your BX home directory. One is your home directory on scofield: it is located in an AFS filesystem at /afs/galaxyproject.org/user/<yourname>, and can be used for routine stuff like notes, documents, or whatever. The other is your home directory on the actual cluster, which will be used by your jobs while they are running. It is located in a ZFS filesystem at /home/<yourname>, but can also be accessed from scofield as /galaxy/home/<yourname>. In contrast, note that the nodes cannot access your scofield home directory in AFS, so the cluster home directory is more useful.

Each time you log in to scofield, your initial working directory will be the AFS one (which you can check with the pwd command). To go to your cluster home directory instead, run the command cd /galaxy/home/<yourname>.


Using Slurm

There are several different ways to run jobs on the cluster, but the one we generally recommend is to put the commands you want to run in a text file called a script, and then submit the script to slurm using the sbatch command. You can also include slurm settings in the script, such as specifying how much memory your job will need, instructing it not to start until some other job has successfully completed, etc. By saving these small scripts you'll have a record of exactly how you ran each job, and convenient templates that you can just copy and tweak for future runs, without having to remember all the syntax.

Slurm is fairly well documented at LLNL. In addition to the reference man pages for each command, there's a Quick Start User Guide with examples, plus more in-depth guides on various aspects of the system. Here we present just a brief introduction to the basics of running jobs in batch mode.

The most common/important/relevant commands you're likely to use are:

sbatch <script>
submits a job script for execution
srun <options> <program> <arguments>
used within a script to run a job step
sinfo -Nel
shows status of nodes
squeue -l
lists running and pending jobs
scontrol show job <id#>
shows detailed info about a job's settings and allocations
scontrol show config
reports the default parameters and how slurm is configured
scancel <id#>
cancels a running or pending job


Useful sbatch / srun options:

--ntasks
informs slurm about the number of tasks your job will perform
--nodes
requests the number of nodes that your job will need
--constraint
requests that the nodes have a particular feature, e.g. new or old
--mem
requests the amount of memory that your job will need   (not working yet)
--tmp
requests the amount of temporary disk space that your job will need
--time
requests the amount of run time that your job will need
--exclusive
requests exclusive use of the assigned nodes
--dependency
specifies that your job should not start until another job has successfully completed


Notes:

  • Use #SBATCH lines within the script to specify options, such as the resources your job will need.
  • Your .bash_profile is generally not used automatically, so you should set all paths and other environment variables explicitly in the script.
  • Keep in mind that paths on the cluster nodes are not necessarily the same as their counterparts on scofield.
  • By default, output is directed to a file named slurm-<id#>.out in the directory from which you ran sbatch. {Is this correct?}
  • The default allocations for unspecified resources can be found by entering the command scontrol show config.
  • Currently this cluster is not configured to treat memory as a job resource, but we'll be changing that.
  • We don't have a maximum possible time limit for jobs. If your job exceeds your stated limit it will be signaled and then killed, but if your limit is too high, the job may be delayed due to lowered priority.
  • Our constrainable node features are new and old.
  • The slurm docs talk about "partitions"; currently we have just one, called general.
  • For non-trivially parallel computing, you'll need some MPI (Message Passing Interface) software. Currently we have OpenMPI installed, but others can be added if needed.


{things to explain}

  • jobs vs. tasks vs. steps
  • nodes vs. processors vs. cpus vs. sockets/cores/threads vs. "slots"
  • what does "the current node" mean, e.g. in source for sbcast?
    (Use path relative to nodes.)


{simple template script with suitable parameters for our site}



{example session using the template}



{which nodes to run jobs on}
It's best to specify your needs and let slurm choose the nodes.


{how to pick suitable job parameters (cores, memory, time, etc.), and how they affect scheduling}


{where to place input and output files to improve performance}

  • /space vs. scratch4
  • sbcast slurm command


{consider adding checkpoints and/or saving intermediate results, so that if a job is killed or hits an error, you won't have to start it all over from the beginning}


{use of "fat nodes" for memory-intensive analysis}
(Postpone this section until the SGI sub-cluster is ready.)


{converting job scripts from PBS and other systems to slurm}


Available software

Software currently available on the cluster includes (or may soon include):

{put detailed list on a separate wiki page}

  • OpenMPI
  • standard Unix tools and utilities
  • file transfer and management: scp, rsync, wget, curl, Mercurial, Git, RCS
  • compilers/interpreters for C, C++, Python, Perl, Java, R
  • virtualenv for Python {get instructions from ticket #1925}
  • common Python and Perl modules: Numpy, CGI, libperl4-corelibs-perl, Biopython, Bioperl, Scipy, Pysam, ...
  • Gnu Scientific Library
  • database software: PostgreSQL, MySQL, ReadDB
  • RepeatMasker
  • alignment software: Blast+, Lastz, Multiz, Blat, ...
  • NGS: FastQC, Samtools, Tophat2, Bowtie, Cufflinks, BamStats, genome assemblers, MiSeq stuff, ...
  • peak callers: MACS
  • UCSC genomics utilities: Liftover, wigToBigWig, bedGraphToBigWig, ...
  • ChromHMM
  • most of the programs behind the Galaxy tools
  • additional software requested by users: PASA, SmrtAnalysis, TGICL, ...
  • {stuff Rebeca uses: Genometools, Trinity, GMAP, Trinotate, SRA toolkit, ESTScan, SSPACE, ...}
  • {Monika's list}


{command to generate a complete, up-to-date list as needed}
We don't currently have such a thing, but look in the following places:

  • /galaxy/software/linux-x86_64/bin
  • /nfs/brubeck.bx.psu.edu/scratch4/software


{how to get more software}
installing in your home directory vs. requesting central install


Unlike BioStar and some other clusters, we don't currently have a module system for specifying the software versions to be used by a particular job. We may add that as a future enhancement, but in the meantime Python's virtualenv feature can provide a similar capability (though only for Python programs).


{GUIs and visualizing results: forward X11 connection, or run elsewhere?}


Data storage

In addition to your cluster home directory /home/<yourname>, both scofield and the nodes can also access files in the storage areas scratch1 – scratch4, via paths like /nfs/brubeck.bx.psu.edu/scratch4/.... These are the same areas accessible from brubeck and some other BX machines, which should help reduce your need to copy large files. They have no per-user disk quotas yet, but that may change in the future. Please be courteous and delete large datasets that you no longer need.

Note that at present, these areas are not backed up, as the available backup services are prohibitively expensive for such vast quantities of data. PSU is investigating solutions for backing up research data, but for now we strongly recommend that you copy the essentials you can't live without to an external drive or other location on a regular basis. The Colo should be much more reliable than 509 Wartik, but problems like hardware failure can still cause data loss.

While jobs are running, they may need to create temporary files to store intermediate results. Each node has about 500GB of local disk space for this purpose, which can be accessed at /space/<yourname>. Being local to the node means that I/O is significantly faster, but keep in mind that this space is shared by all of the jobs running on that node, and files must be deleted when the job finishes. {can we make this happen automatically?}

{explain how to transfer data between systems with scp and rsync}


Alternative resources

{pros/cons vs. BioStar, CyberStar, LionX, Galaxy, cloud, etc.}


For admins

{ansible vs. ad-hoc local installation of software}

{extending a job's time limit}
scontrol update JobId=<id#> TimeLimit=[+]<days>-<hours>

{running usage reports, by user and PI}