This section presents a few examples regarding jobs' submissions under the CERIT-SC infrastructure.
At first, let us summarize several fundamental principles and suggestions:
At first, check, whether the application you require is available in the CERIT-SC/MetaCentrum infrastructure (see the list of available applications being referenced in the Available Applications section)
Decide, where your job will read the input data from, and where it will save the working and output data
Submit the job via the Torque or PBS Pro submission systems (using the qsub command); do not forget to specify the requirements your job has -- the number of execution nodes, available memory, as well as available application licences.
Most probably, the job will wait for available nodes (or application license) after submission; the reason of job's waiting can be found using the following command, being run on any of the available frontends (see the comment section in the command's output):
The common waiting reasons are:
- there is not enough nodes fitting the required properties; the job is waiting for another job to finish and to release the requested resources
- the scheduling server's input queue limits the number of simultaneously running jobs of any user; if you have submitted more jobs than the limit is, the waiting job will be run once another (your) job finishes.
Running the batch-jobs under the CERIT-SC infrastructure
1. Access
Log on the CERIT-SC frontend (see the section Infrastructure access):
2. Prepare the job's input data
Copy the job's input data to your home (sub)directory (/storage/.../home/<username>/), available on the frontend.
More inforamtion about storage.
3. Prepare the job's startup script
I. Use your favourite text editor for creating a shell script (named, for example, script.sh), which will perform an initialization of the modular system, and which will perform an initialization of the application you want to use (see the section Applications):
#PBS -l select=1:ncpus=2:mem=4gb:scratch_local=10gb
#PBS -l walltime=01:30:00
#PBS -N example
# initialize the required application (e.g. Python, version 3.4.1, compiled by gcc)
module python-3.4.1-gcc
II. Instruct your script to transfer the job's input data to a subdirectory of the fast /scratch storage volume (before the computation), and subsequently, to transfer the job's output data and to clean the working directory (after the computation).
Add the following lines at the end of the above script:
a) Shared storage (home via NFS)
DATADIR="/storage/brno3-cerit/home/$LOGNAME/example"
# clean the SCRATCH when job finishes (and data
# are successfully copied out) or is killed
trap 'clean_scratch' TERM EXIT
cp $DATADIR/app.py $DATADIR/input.txt $SCRATCHDIR
cd $SCRATCHDIR
# ... the computation ...
python app.py input.txt
# copy resources from scratch directory back on disk
# field, if not successful, scratch is not deleted
cp output.txt $DATADIR || export CLEAN_SCRATCH=false
b) Network connected storage (without NFS)
DATADIR="storage-brno3-cerit.metacentrum.cz:~/example"
# clean the SCRATCH when job finishes (and data
# are successfully copied out) or is killed
trap 'clean_scratch' TERM EXIT
scp $DATADIR/app.py $DATADIR/input.txt $SCRATCHDIR
# use "scp -R ..." in case of copying directories
cd $SCRATCHDIR
# ... the computation ...
python app.py input.txt
# copy resources from scratch directory back on disk
scp output.txt $DATADIR || export CLEAN_SCRATCH=false
# use "scp -R ..." in case of copying directories
4. Submit the (batch) job
Specify the job's resource requirements (number of execution nodes, requested processors, memory, scratch size and
type, application license, etc.) and the maximum job run-time (walltime).
These resource requirements can be specified:
- inside the script file
- as command-line arguments to qsub utility (see below)
- in both places – in such a case, the command-line arguments have higher priority and override the values written
Qsub example:
New select syntax offers more possibilities – chunks usage, submitting requests on specific machine, work with cgroups, request for presence/absence of specific machine feature, ...
Qsub modications: Attention, for better comprehensibility, following
examples aren’t complete. Memory, scratch, walltime ... may not be assigned!
- Two chunks, one with 1 processor and second with 2: qsub -l select=1:ncpus=1+1:ncpus=2:mem= ...
- Request for specific node tarkil3.metacentrum.cz:
qsub -l select=1:ncpus=1:vnode=tarkil3:mem=1gb ... - Request for node with or without feature "cl_tarkil":
qsub -l select=1:ncpus=1:cl_tarkil=True ...
qsub -l select=1:ncpus=1:cl_tarkil=False ... - Request for machine with cgroups:
qsub -l select=1:ncpus=1:cgroups=True ... - Request for 2 chunks on exclusive node/s:
qsub -l select=2:ncpus=1 -l place=excl ... - All chunks must be on one node:
qsub -l select=2:ncpus=1 -l place=pack ... - Using scratch <scratch_local|scratch_ssd|scratch_shared>, there is no default type of scratch!:
qsub -l select=1:ncpus=1:mem=4gb:scratch_ssd=1gb ... - Interactive job:
qsub -I -l select=1:ncpus=4:mem=1gb ... - GPU computing (GPU cards IDs in CUDA_VISIBLE_DEVICES variable):
qsub -l select=1:ncpus=1:ngpus=2 -q gpu ...
5. Monitoring the job's state:
- qstat -u <login> # lists all user running or waiting jobs
- qstat -xu <login> # list all user jobs (also finished)
- qstat <jobID> # outputs basic information about the job
- qstat -f <jobID> # outputs more detailed information about running or waiting job
- qstat -xf <jobID> # outputs more detailed information about running, waiting or finished job
- qdel <jobID> # kills the job (if necessary)
See the man qstat page for a description of all options.
Status of the job (most common):
Q ... queued E ... exiting
R ... running F ... finished
6. Killing the job: (if necessary)
If you need to forcibly terminate/kill any of your job (no matter if already running or just waiting), use the qdel command as follows:
7. Getting the standard output (stdout) and standard error output (stderr) of your job:
- <job_name>.o<jobID> ... standard output
- <job_name>.e<jobID> ... standard error output
Exploring the files can show you job's results, or can inspire you in tracing its error run.
Both outputs can be merged to a single file by adding parameter qsub -j oe.
- <job_name>.o<jobID> ... standard and error output
8. Submitting the batch jobs, dedicated to run on CERIT-SC resources, through the MetaCentrum frontends
- When submitting the job using the qsub command, do not forget to explicitly specify the CERIT-SC scheduling server
qsub -q default@wagap-pro.cerit-sc.cz
- Similarly, to list all your recently submitted jobs, explicitly specify the scheduling server, which should be asked by the qstat command:
$ qstat -u <username> @wagap-pro.cerit-sc.cz
Useful links
Qsub assembler: https://metavo.metacentrum.cz/pbsmon2/person
PBSMon web application monitoring jobs: https://metavo.metacentrum.cz/pbsmon2/
Documentation: https://wiki.metacentrum.cz/wiki/