CERIT logo

CERIT-SC

  • ABOUT CERIT-SC
    • HISTORY
    • PEOPLE
    • CONTACTS
    • NEWS
    • PROJECT VAVPI
  • MISSION
  • RESEARCH
  • HARDWARE
  • DOCUMENTATION
    • QUICKSTART
    • INFRASTRUCTURE ACCESS
    • JOBS'/NODES' PROPERTY SPECIFICATIONS
    • STORAGE VOLUMES
    • APPLICATIONS
    • EXAMPLES
    • USAGE RULES
    • TECHNICAL SUPPORT
    • TECHNICAL DETAILS
  • EVENTS
  • TENDERS

CS | EN


    • About CERIT-SC
    • Mission
    • Research
    • Hardware
    • Documentation
      • » QuickStart
      • » Infrastructure Access
      • » Jobs'/Nodes' property specifications
      • » Storage volumes
      • » Applications
      • » Examples
      • » Usage rules
      • » Technical support
      • » Technical details
    • Events
    • Tenders

logo CERIT

logo MU

logo EU

logo OPVaVpI

Examples

Documentation » Examples

This section presents a few examples regarding jobs' submissions under the CERIT-SC infrastructure.

At first, let us summarize several fundamental principles and suggestions:

  1. at first, check, whether the application you require is available in the CERIT-SC infrastructure (see the list of available applications being referenced in the Available Applications section)
  2. decide, where your job will read the input data from, and where it will save the working and output data (see the list of available storage volumes in the Storage section)
  3. submit the job via the Torque submission system (using the qsub command); do not forget to specify the requirements your job has -- the number of execution nodes, available memory, as well as available application licences (see the details in the Nodes'/Jobs' property specification)
  4. most probably, the job will wait for available nodes (or application license) after submission; the reason of job's waiting can be found using the following command, being run on any of the available frontends (see the comment section in the command's output):
    $ qstat -f <jobid>
    where <jobid> is your job's identifier (e.g.,  12345.wagap.cerit-sc.cz)

    The common waiting reasons are:

    • Not Running: Not enough nodes fitting the nodespec found - there is not enough nodes fitting the required properties; the job is waiting for another job to finish and to release the requested resources
    • Not Running: User has reached queue running job limit - the scheduling server's input queue limits the number of simultaneously running jobs of any user; if you have submitted more jobs than the limit is, the waiting job will be run once another (your) job finishes.
    • Not Running: Draining system to allow starving job to run - there is a multiprocessor job waiting in the input queue; the system blocks another (less processor demanding) jobs not to overtake the multiprocessor job.

Files (storage volumes) treating suggestions:

  1. /home ... use this volume for storing your data (home directory).
  2. /scratch ... if your job operates with large files, store them in the relevant subdirectory of the /scratch volume (on the machine your job really runs!). This volume is provided by high-performance local discs, and thus the delay of IO operations will be minimized.
  3. /storage ... use this volume for data, which you want to make directly available from the MetaCentrum nodes, as well as for data, which you want to make permanent and backed-up. (see the section Storage).

 

Running the batch-jobs under the CERIT-SC infrastructure

1. Log on the frontend:

Log on the CERIT-SC frontend (see the section Infrastructure access):

$ ssh <username>@zuphux.cerit-sc.cz

Note: Use a similar procedure to log-in using alternative clients (e.g., Putty).

2. Prepare the job's input data:
Copy the job's input data to your home (sub)directory (/home/<username>/), available on the frontend.
3. Prepare the job's startup script:

I. Use your favourite text editor for creating a shell script (named, for example, myjob.sh), which will perform an initialization of the modular system, and which will perform an initialization of the application you want to use (see the section Applications):

#!/bin/bash
# initialize the modular system:
. /packages/run/modules-2.0/init/sh
# initialize the required application
# (in this case, the parallel Amber version 10)
module add amber10-parallel

II. Instruct your script to transfer the job's input data to a subdirectory of the fast /scratch storage volume (before the computation), and subsequently, to transfer the job's output data and to clean the working directory (after the computation).

Add the following lines at the end of the above script:

LOGNAME=`whoami` # programs run via pbsdsh do not have a complete environment
 
DATADIR="/storage/brno1/home/$LOGNAME/" # shared via NFSv4
WORKSPACE="/scratch/$LOGNAME/$PBS_JOBID/" # local disk
 
mkdir $WORKSPACE
cp $DATADIR/vstup.txt $WORKSPACE || exit 1
cd $WORKSPACE || exit 2
 
# ... the computation ...
 
cp $WORKSPACE/vystup.txt $DATADIR
if [ $? -ne 0 ]; then
    echo Copy output data failed. Copy them manualy from `hostname` 1>&2
    exit 3
fi
 
rm -rf $WORKSPACE
if [ $? -ne 0 ]; then
    echo Cleanup failed. Please, remove data manually from `hostname` 1>&2
    exit 4
fi

Notes:

a) If your job does not require to use another applications, it's not necessary to initialize the modular system..

b) If your job does not perform intensive I/O operations, neither the copying of input/output data to the /scratch is necessary (the job may have the input/output data saved in the /home volume).

4. Submit the (batch) job:

Specify the job's requirements on execution nodes (number of nodes, available processors, memory, application licences, etc.) and the maximum job run-time (see details in the section Nodes'/Jobs' property specification).

Via the qsub command, pass these parameters to the scheduling system (the command will output an identifier of the submitted job (jobID)):

$ qsub –l walltime=<timespec> -l nodes=<nodesnum>:ppn=<procpernode> myjob.sh
12345.wagap.cerit-sc.cz   # = jobID

Example: To submit a job lasting at most 10 hours and requesting 2 execution nodes (on each of them, 10 processors and 4 GB of memory), use the following command (the job will be submitted via the CERIT-SC frontend):

qsub –l walltime=10:00:00 -l nodes=2:ppn=10,mem=4gb myjob.sh

Notes:

a) Instead of specifying via the qsub command, the resource requirements could be also specified inside the job's startup script -- add the following lines just behind the first script's line (behind #!/bin/bash ):

#PBS –l walltime=<timespec>
#PBS –l nodes=<nodesnum>:ppn=<procpernode>:...

The subsequent job submission is performed using the following command:

$ qsub myjob.sh

b) If you need help with qsub parameters assembly, use the MetaCentrum's qsub assembly helper utility.

5. Monitoring the job's state:

To monitor the job's state, use the qstat command as follows:

$ qstat <jobID>   # outputs basic information about the job
$ qstat -f <jobID>   # outputs exhaustive information about the job

Beyond others, the output informs you about job's elapsed time and about its actual state, which takes one of the following values:

  • Q ... queued
  • R ... running
  • E ... exiting
  • C ... completed

Note: The list of all your recent jobs (including basic information about them) could be found using the following command:

$ qstat –u <username>
6. Killing the job: (if necessary)

If you need to forcibly terminate/kill any of your job (no matter if already running or just waiting), use the qdel command as follows:

$ qdel <jobID>

Note: Once asking for a forcible termination, the job will get to the E (exiting) state, and subsequently to the C (completed) state.

7. Getting the standard output (stdout) and standard error output (stderr) of your job:

Once a job terminates (either regularly or forcibly), there are the following files (respresenting standard output and standard error output) created in your home directory of the frontend, through which you have submitted the job:

  • <job_name>.o<jobID> ... standard output
  • <job_name>.e<jobID> ... standard error output

for example, the files "myjob.o12345" and "myjob.e12345". Exploring the files can show you job's results, or can inspire you in tracing its error run.

Submitting the batch jobs, dedicated to run on CERIT-SC resources, through the MetaCentrum frontends:
If you want to submit a CERIT-SC batch job through the MetaCentrum frontends, apply the following changes on the manual above:
  • point 2. Copy the job's input data to a (sub)directory on the /storage volume (the CERIT-SC /home volume is not directly available from the MetaCentrum nodes)
  • point 3. Based on the previous point, perform the relevant changes in the startup script's part copying the input/output data on/from the /scratch volume
  • point 4. When submitting the job using the qsub command, do not forget to explicitly specify the CERIT-SC scheduling server (option "-q @wagap.cerit-sc.cz")
  • point 5. Similarly, to list all your recently submitted jobs, explicitly specify the scheduling server, which should be asked by the qstat command:
    $ qstat -u <username> wagap.cerit-sc.cz



(c) 2011 CERIT - Center for Education, Research and Innovation in ICT in Brno

BRNO