Examples | CERIT Scientific Cloud

This section presents a few examples regarding jobs' submissions under the CERIT-SC infrastructure.

At first, let us summarize several fundamental principles and suggestions:

At first, check, whether the application you require is available in the CERIT-SC/MetaCentrum infrastructure (see the list of available applications being referenced in the Available Applications section)

Decide, where your job will read the input data from, and where it will save the working and output data

Submit the job via the Torque or PBS Pro submission systems (using the qsub command); do not forget to specify the requirements your job has -- the number of execution nodes, available memory, as well as available application licences.

Most probably, the job will wait for available nodes (or application license) after submission; the reason of job's waiting can be found using the following command, being run on any of the available frontends (see the comment section in the command's output):

$ qstat -f <jobid>

where <jobid> is your job's identifier (e.g., 12345.wagap-pro.cerit-sc.cz)

The common waiting reasons are:

there is not enough nodes fitting the required properties; the job is waiting for another job to finish and to release the requested resources
the scheduling server's input queue limits the number of simultaneously running jobs of any user; if you have submitted more jobs than the limit is, the waiting job will be run once another (your) job finishes.

Running the batch-jobs under the CERIT-SC infrastructure

1. Access

Log on the CERIT-SC frontend (see the section Infrastructure access):

$ ssh <username>@zuphux.cerit-sc.cz

Password:

Note: Use a similar procedure to log-in using alternative clients (e.g., Putty).

2. Prepare the job's input data

Copy the job's input data to your home (sub)directory (/storage/.../home/<username>/), available on the frontend.

More inforamtion about storage.

3. Prepare the job's startup script

I. Use your favourite text editor for creating a shell script (named, for example, script.sh), which will perform an initialization of the modular system, and which will perform an initialization of the application you want to use (see the section Applications):

#!/bin/bash
#PBS -l select=1:ncpus=2:mem=4gb:scratch_local=10gb
#PBS -l walltime=01:30:00
#PBS -N example
# initialize the required application (e.g. Python, version 3.4.1, compiled by gcc)
module python-3.4.1-gcc

II. Instruct your script to transfer the job's input data to a subdirectory of the fast /scratch storage volume (before the computation), and subsequently, to transfer the job's output data and to clean the working directory (after the computation).

Add the following lines at the end of the above script:

a) Shared storage (home via NFS)

# storage is shared via NFSv4
DATADIR="/storage/brno3-cerit/home/$LOGNAME/example"
# clean the SCRATCH when job finishes (and data
# are successfully copied out) or is killed
trap 'clean_scratch' TERM EXIT
cp $DATADIR/app.py $DATADIR/input.txt $SCRATCHDIR
cd $SCRATCHDIR
# ... the computation ...
python app.py input.txt
# copy resources from scratch directory back on disk
# field, if not successful, scratch is not deleted
cp output.txt $DATADIR || export CLEAN_SCRATCH=false

b) Network connected storage (without NFS)

# secure copy using scp (disks not connected via NFS)
DATADIR="storage-brno3-cerit.metacentrum.cz:~/example"
# clean the SCRATCH when job finishes (and data
# are successfully copied out) or is killed
trap 'clean_scratch' TERM EXIT
scp $DATADIR/app.py $DATADIR/input.txt $SCRATCHDIR
# use "scp -R ..." in case of copying directories
cd $SCRATCHDIR
# ... the computation ...
python app.py input.txt
# copy resources from scratch directory back on disk

# field, if not successful, scratch is not deleted
scp output.txt $DATADIR || export CLEAN_SCRATCH=false
# use "scp -R ..." in case of copying directories

4. Submit the (batch) job

Specify the job's resource requirements (number of execution nodes, requested processors, memory, scratch size and
type, application license, etc.) and the maximum job run-time (walltime).
These resource requirements can be specified:

inside the script file
as command-line arguments to qsub utility (see below)
in both places – in such a case, the command-line arguments have higher priority and override the values written

Qsub example:

zuphux$ qsub -l select=1:ncpus=2:mem=10gb:scratch_local=1gb -l walltime=1:00:00

-l matlab=1 script.sh

12345.wagap-pro.ics.muni.cz # = jobID received from the PBS system

New select syntax offers more possibilities – chunks usage, submitting requests on specific machine, work with cgroups, request for presence/absence of specific machine feature, ...

Qsub modications: Attention, for better comprehensibility, following
examples aren’t complete. Memory, scratch, walltime ... may not be assigned!

Two chunks, one with 1 processor and second with 2: qsub -l select=1:ncpus=1+1:ncpus=2:mem= ...
Request for specific node tarkil3.metacentrum.cz:
qsub -l select=1:ncpus=1:vnode=tarkil3:mem=1gb ...
Request for node with or without feature "cl_tarkil":
qsub -l select=1:ncpus=1:cl_tarkil=True ...
qsub -l select=1:ncpus=1:cl_tarkil=False ...
Request for machine with cgroups:
qsub -l select=1:ncpus=1:cgroups=True ...
Request for 2 chunks on exclusive node/s:
qsub -l select=2:ncpus=1 -l place=excl ...
All chunks must be on one node:
qsub -l select=2:ncpus=1 -l place=pack ...
Using scratch <scratch_local|scratch_ssd|scratch_shared>, there is no default type of scratch!:
qsub -l select=1:ncpus=1:mem=4gb:scratch_ssd=1gb ...
Interactive job:
qsub -I -l select=1:ncpus=4:mem=1gb ...
GPU computing (GPU cards IDs in CUDA_VISIBLE_DEVICES variable):
qsub -l select=1:ncpus=1:ngpus=2 -q gpu ...

5. Monitoring the job's state:

qstat -u <login> # lists all user running or waiting jobs
qstat -xu <login> # list all user jobs (also finished)
qstat <jobID> # outputs basic information about the job
qstat -f <jobID> # outputs more detailed information about running or waiting job
qstat -xf <jobID> # outputs more detailed information about running, waiting or finished job
qdel <jobID> # kills the job (if necessary)

See the man qstat page for a description of all options.

Status of the job (most common):
Q ... queued E ... exiting
R ... running F ... finished

6. Killing the job: (if necessary)

If you need to forcibly terminate/kill any of your job (no matter if already running or just waiting), use the qdel command as follows:

$ qdel <jobID>

7. Getting the standard output (stdout) and standard error output (stderr) of your job:

Once a job terminates (either regularly or forcibly), there are the following files (representing standard output and standard error output) created in the directory from which you submitted the job (the directory where the qsub command was performed):

<job_name>.o<jobID> ... standard output
<job_name>.e<jobID> ... standard error output

for example, the files "script.o12345" and "script.e12345".
Exploring the files can show you job's results, or can inspire you in tracing its error run.
Both outputs can be merged to a single file by adding parameter qsub -j oe.

<job_name>.o<jobID> ... standard and error output

8. Submitting the batch jobs, dedicated to run on CERIT-SC resources, through the MetaCentrum frontends

If you want to submit a CERIT-SC batch job through the MetaCentrum frontends, apply the following changes on the manual above:

When submitting the job using the qsub command, do not forget to explicitly specify the CERIT-SC scheduling server

qsub -q default@wagap-pro.cerit-sc.cz

Similarly, to list all your recently submitted jobs, explicitly specify the scheduling server, which should be asked by the qstat command:

$ qstat -u <username> @wagap-pro.cerit-sc.cz

Qsub assembler: https://metavo.metacentrum.cz/pbsmon2/person
PBSMon web application monitoring jobs: https://metavo.metacentrum.cz/pbsmon2/
Documentation: https://wiki.metacentrum.cz/wiki/