Examples
This section presents a few examples regarding jobs' submissions under the CERIT-SC infrastructure.
At first, let us summarize several fundamental principles and suggestions:
- at first, check, whether the application you require is available in the CERIT-SC infrastructure (see the list of available applications being referenced in the Available Applications section)
- decide, where your job will read the input data from, and where it will save the working and output data (see the list of available storage volumes in the Storage section)
- submit the job via the Torque submission system (using the
qsubcommand); do not forget to specify the requirements your job has -- the number of execution nodes, available memory, as well as available application licences (see the details in the Nodes'/Jobs' property specification) - most probably, the job will wait for available nodes (or application license) after submission; the reason of job's waiting can be found using the following command, being run on any of the available frontends (see the comment section in the command's output):
$ qstat -f <jobid>where<jobid>is your job's identifier (e.g., 12345.wagap.cerit-sc.cz)The common waiting reasons are:
Not Running: Not enough nodes fitting the nodespec found- there is not enough nodes fitting the required properties; the job is waiting for another job to finish and to release the requested resourcesNot Running: User has reached queue running job limit- the scheduling server's input queue limits the number of simultaneously running jobs of any user; if you have submitted more jobs than the limit is, the waiting job will be run once another (your) job finishes.Not Running: Draining system to allow starving job to run- there is a multiprocessor job waiting in the input queue; the system blocks another (less processor demanding) jobs not to overtake the multiprocessor job.
Files (storage volumes) treating suggestions:
/home... use this volume for storing your data (home directory)./scratch... if your job operates with large files, store them in the relevant subdirectory of the/scratchvolume (on the machine your job really runs!). This volume is provided by high-performance local discs, and thus the delay of IO operations will be minimized./storage... use this volume for data, which you want to make directly available from the MetaCentrum nodes, as well as for data, which you want to make permanent and backed-up. (see the section Storage).
Running the batch-jobs under the CERIT-SC infrastructure
| 1. Log on the frontend: |
|---|
|
Log on the CERIT-SC frontend (see the section Infrastructure access): $ ssh <username>@zuphux.cerit-sc.cz |
|
Note: Use a similar procedure to log-in using alternative clients (e.g., Putty). |
| 2. Prepare the job's input data: |
Copy the job's input data to your home (sub)directory (/home/<username>/), available on the frontend. |
| 3. Prepare the job's startup script: |
|
I. Use your favourite text editor for creating a shell script (named, for example, #!/bin/bash# initialize the modular system:. /packages/run/modules-2.0/init/sh# initialize the required application# (in this case, the parallel Amber version 10)module add amber10-parallel |
|
II. Instruct your script to transfer the job's input data to a subdirectory of the fast Add the following lines at the end of the above script: LOGNAME=`whoami` # programs run via pbsdsh do not have a complete environmentDATADIR="/storage/brno1/home/$LOGNAME/" # shared via NFSv4WORKSPACE="/scratch/$LOGNAME/$PBS_JOBID/" # local disk mkdir $WORKSPACEcp $DATADIR/vstup.txt $WORKSPACE || exit 1cd $WORKSPACE || exit 2# ... the computation ...cp $WORKSPACE/vystup.txt $DATADIRif [ $? -ne 0 ]; then echo Copy output data failed. Copy them manualy from `hostname` 1>&2 exit 3firm -rf $WORKSPACEif [ $? -ne 0 ]; then echo Cleanup failed. Please, remove data manually from `hostname` 1>&2 exit 4fi |
|
Notes: a) If your job does not require to use another applications, it's not necessary to initialize the modular system.. b) If your job does not perform intensive I/O operations, neither the copying of input/output data to the |
| 4. Submit the (batch) job: |
|
Specify the job's requirements on execution nodes (number of nodes, available processors, memory, application licences, etc.) and the maximum job run-time (see details in the section Nodes'/Jobs' property specification). Via the $ qsub –l walltime=<timespec> -l nodes=<nodesnum>:ppn=<procpernode> myjob.sh12345.wagap.cerit-sc.cz # = jobID |
|
Example: To submit a job lasting at most 10 hours and requesting 2 execution nodes (on each of them, 10 processors and 4 GB of memory), use the following command (the job will be submitted via the CERIT-SC frontend): qsub –l walltime=10:00:00 -l nodes=2:ppn=10,mem=4gb myjob.sh |
|
Notes: a) Instead of specifying via the #PBS –l walltime=<timespec>#PBS –l nodes=<nodesnum>:ppn=<procpernode>:...The subsequent job submission is performed using the following command: $ qsub myjob.shb) If you need help with |
| 5. Monitoring the job's state: |
|
To monitor the job's state, use the $ qstat <jobID> # outputs basic information about the job$ qstat -f <jobID> # outputs exhaustive information about the jobBeyond others, the output informs you about job's elapsed time and about its actual state, which takes one of the following values:
|
|
Note: The list of all your recent jobs (including basic information about them) could be found using the following command: $ qstat –u <username> |
| 6. Killing the job: (if necessary) |
|
If you need to forcibly terminate/kill any of your job (no matter if already running or just waiting), use the $ qdel <jobID> |
|
Note: Once asking for a forcible termination, the job will get to the E (exiting) state, and subsequently to the C (completed) state. |
| 7. Getting the standard output (stdout) and standard error output (stderr) of your job: |
|
Once a job terminates (either regularly or forcibly), there are the following files (respresenting standard output and standard error output) created in your home directory of the frontend, through which you have submitted the job:
for example, the files " |
- point 2. Copy the job's input data to a (sub)directory on the
/storagevolume (the CERIT-SC/homevolume is not directly available from the MetaCentrum nodes) - point 3. Based on the previous point, perform the relevant changes in the startup script's part copying the input/output data on/from the
/scratchvolume - point 4. When submitting the job using the
qsubcommand, do not forget to explicitly specify the CERIT-SC scheduling server (option "-q @wagap.cerit-sc.cz") - point 5. Similarly, to list all your recently submitted jobs, explicitly specify the scheduling server, which should be asked by the
qstatcommand:
$ qstat -u <username> wagap.cerit-sc.cz




