Jobs in gLite

Simple jobs

To execute a job it is necessary to describe it using a JDL file. The JDL describes the job and its requirements.

Here is the simplest example of a JDL file, to run a simple job on the grid:

[morgan@grid003 ASTRO]$ cat hostname.jdl 
[ 
   Type="Job"; 
   JobType="Normal"; 
   Executable = "/bin/hostname"; 
   StdError = "hostname.err"; 
   StdOutput = "hostname.out"; 
   OutputSandbox = {"hostname.err", "hostname.out"}; 
   RetryCount = 1; 
] 

The Executable attribute specifies the executable that will run on the Worker Node. The StdOutput and StdError attributes respectively identifies the name of the files where the executable standard output and error is redirected . The OutputSandbox attribute specifies the files a user wants to copy back after job execution, normally the standard output and error + some files produced during the execution. Note that there is a size limit in the OutputSandbox (the total size cannon exceed 30 MB). Also the number of retries is specified, in case of failures.

JDL is more fully described here

To submit a job to the Grid it is necessary to delegate credentials to the WM Proxy server.

The user can either specify the delegationId to be associated with the delegated proxy by using the --delegationid option (shortly -d):

glite-wms-job-delegate-proxy -d myfirstdelegationid

Using -d option, the delegation is created, and its name is hold, so that subsequent invocations of glite-wms-job-submit and glite-wms-job-list-match can be given that delegation name, bypassing the delegation of a new proxy.

Instead of creating a delegation, it could be used -a option, which causes a delegated proxy to be established automatically. When using -a option, you don't need to run glite-wms-job-delegate-proxy -d , but you have to specify -a option for each use of glite-wms-job-submit and glite-wms-job-list-match. However, massive use of this option it's not recommended, since it delegates a new proxy for each command issued, and delegation is a time-consuming operation, so it's better to do it once with glite-wms-job-delegate-proxy and reuse it.

[morgan@grid003 ASTRO]$ glite-wms-job-delegate-proxy -d $USER

Connecting to the service https://grid005.oats.inaf.it:7443/glite_wms_wmproxy_server


================== glite-wms-job-delegate-proxy Success ==================

Your proxy has been successfully delegated to the WMProxy:
https://grid005.oats.inaf.it:7443/glite_wms_wmproxy_server

with the delegation identifier: morgan

==========================================================================

A JDL (Job Description Language) file describes a job that can be run. Before running the job, it is useful to test which computing elements (CE's) are able to accept it. Do this with the command glite-wms-job-list-match. As you can see, with "-d" option allows you to specify the delegation identifier you have created.

[morgan@grid003 ASTRO]$ glite-wms-job-list-match -d $USER hostname.jdl

Connecting to the service https://grid005.oats.inaf.it:7443/glite_wms_wmproxy_server

==========================================================================

                     COMPUTING ELEMENT IDs LIST
 The following CE(s) matching your job requirements have been found:

        *CEId*
 - ce-nano-37.to.infn.it:2119/jobmanager-lcgpbs-infinite
 - ce-nano-37.to.infn.it:2119/jobmanager-lcgpbs-long

To submit a job we use:


[morgan@grid003 ASTRO]$ glite-wms-job-submit -d $USER -o jobid hostname.jdl

Connecting to the service https://grid005.oats.inaf.it:7443/glite_wms_wmproxy_server


====================== glite-wms-job-submit Success ======================

The job has been successfully submitted to the WMProxy
Your job identifier is:

https://grid005.oats.inaf.it:9000/rRRnnGoCyLRreXaySIQUIQ

The job identifier has been saved in the following file:
/home/morgan/jobid

==========================================================================

The file /home/morgan/jobid is the output of the submission process. It receives the jobID(s) returned by the submission process. If another job is submitted (by repeating the submission line) its jobID is appended to the same jobID file. Try it by yourself.

In order to know about the job status another command is available:


[morgan@grid003 ASTRO]$ glite-wms-job-status -i jobid


*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://grid005.oats.inaf.it:9000/oNqZ-9WkDlNMOZBwwrWPMA
Current Status:     Done (Success)
Logged Reason(s):
    - 
    - Job terminated successfully
Exit code:          0
Status Reason:      Job terminated successfully
Destination:        grid010.ct.infn.it:2119/jobmanager-lcgpbs-short
Submitted:          Mon Jun 16 09:34:41 2008 CEST
*************************************************************

The -i option describes the file from which the command takes the jobID(s) to be inspected. The job status is Done, so the job ends successfully from the point of view of the WMS. The Exit code is the exit code of tour executable that code can be used to verify if your code runs successfully on the WN.

Alternatively the same command can be issued directly specifying the jobID(s), as in the following case:

 [morgan@grid003 ASTRO]$ glite-wms-job-status https://grid005.oats.inaf.it:9000/rRRnnGoCyLRreXaySIQUIQ

*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://grid005.oats.inaf.it:9000/rRRnnGoCyLRreXaySIQUIQ
Current Status:     Done (Success)
Logged Reason(s):
    - 
    - Job terminated successfully
Exit code:          0
Status Reason:      Job terminated successfully
Destination:        grid010.ct.infn.it:2119/jobmanager-lcgpbs-short
Submitted:          Mon Jun 16 09:34:41 2008 CEST
*************************************************************

Note that this command doesn't require a delegation identifier to be specified.

Alternatively it is possible to select a specific WMS:

[morgan@localhost TUTORIAL]$ glite-wms-job-submit -e  https://grid005.oats.inaf.it:7443/glite_wms_wmproxy_server -d $USER hostname.jdl 

Connecting to the service https://grid005.oats.inaf.it:7443/glite_wms_wmproxy_server


====================== glite-wms-job-submit Success ======================

The job has been successfully submitted to the WMProxy
Your job identifier is:

https://grid005.oats.inaf.it:9000/RQXckvmshVPiJiVhznrSIQ

==========================================================================


[morgan@localhost TUTORIAL]$ glite-wms-job-status https://grid005.oats.inaf.it:9000/RQXckvmshVPiJiVhznrSIQ


*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://grid005.oats.inaf.it:9000/RQXckvmshVPiJiVhznrSIQ
Current Status:     Ready 
Status Reason:      unavailable
Destination:        gridce.sns.it:2119/jobmanager-lcgpbs-grid
Submitted:          Wed Aug 20 19:51:01 2008 CEST
*************************************************************

To get the output we use the command:

[morgan@localhost TUTORIAL]$ glite-wms-job-output --dir myjobs  https://grid005.oats.inaf.it:9000/RQXckvmshVPiJiVhznrSIQ

Connecting to the service https://grid005.oats.inaf.it:7443/glite_wms_wmproxy_server


================================================================================

                        JOB GET OUTPUT OUTCOME

Output sandbox files for the job:
https://grid005.oats.inaf.it:9000/RQXckvmshVPiJiVhznrSIQ
have been successfully retrieved and stored in the directory:
/home/morgan/TUTORIAL/myjobs

================================================================================


[morgan@localhost TUTORIAL]$ ls myjobs/
hostname.err  hostname.out

More complex jobs

Normally a job that is submitted to the grid requires the execution of a binary file compiled by the user and that requires some input files and/or arguments.

For exmaple suppose to have a program called exec.x

[morgan@localhost TUTORIAL]$ ./exec.x -h

Usage: exec.x -d -o 
           -d number of iterations
           -o outputfile [optional otherwise goes to stdout] 

We want to execute it in the Grid, so we prepare a jdl

[morgan@localhost TUTORIAL]$ cat exec.jdl
[
   Type      = "Job"; 
   JobType = "Normal"; 
   Executable = "exec.x"; 
   Arguments = "-d 10"; 
   StdOutput = "std.out";  
   StdError = "std.err"; 
   OutputSandbox = {"std.err","std.out"}; 
   InputSandbox = {"exec.x"}; 
   RetryCount = 3;
] 

This JDL is done so that

[morgan@localhost TUTORIAL]$ glite-wms-job-submit -e  https://grid005.oats.inaf.it:7443/glite_wms_wmproxy_server -d $USER exec.jdl 

Connecting to the service https://grid005.oats.inaf.it:7443/glite_wms_wmproxy_server


====================== glite-wms-job-submit Success ======================

The job has been successfully submitted to the WMProxy
Your job identifier is:

https://grid005.oats.inaf.it:9000/RQXckvmshVPiJiVhznrSIQ

==========================================================================


[morgan@localhost TUTORIAL]$ glite-wms-job-status https://grid005.oats.inaf.it:9000/RQXckvmshVPiJiVhznrSIQ


*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://grid005.oats.inaf.it:9000/RQXckvmshVPiJiVhznrSIQ
Current Status:     Ready 
Status Reason:      unavailable
Destination:        gridce.sns.it:2119/jobmanager-lcgpbs-grid
Submitted:          Wed Aug 20 19:51:01 2008 CEST
*************************************************************

It is also possible to use the OutputSendbox to get some small files produced by the job for example:

[morgan@localhost TUTORIAL]$ cat exec.jdl
[
   Type      = "Job"; 
   JobType = "Normal"; 
   Executable = "exec.x"; 
   Arguments = "-d 10 -o out.txt"; 
   StdOutput = "std.out";  
   StdError = "std.err"; 
   OutputSandbox = {"std.err","std.out", "out.txt"}; 
   InputSandbox = {"exec.x"}; 
   RetryCount = 3;
] 

In this case our exec.x code will produce also an output file called out.txt that we download using the OutputSandbox.

Job that produces big output files

If out program produce a number of big files, it is not possible to use the OutputSandbox to get them. It is necessary to use the Data management service. Here we will see an example.

Suppose we have a program called exec.x that requires an input file to run and that produces some output files: output1.dat, output2.dat, output3.dat.

[morgan@localhost JOB_WITH_OUT]$ ./exec.x 
Usage: exec.x -i filename
       -i input file name

Output files:
                output1.dat
                output2.dat
                output3.dat
[morgan@localhost JOB_WITH_OUT]$ 

To run this program on the Grid it is necessary to create a shell (or perl/python...) script.

We prepare a script file to execute it on the Grid:

#!/bin/sh

# set global LFC variables on your service

export LCG_GFAL_INFOSYS=egee-bdii.cnaf.infn.it:2170  # define the BDII endpoint 
export LCG_CATALOG_TYPE=lfc                # define the catalog type
export LFC_HOST=lfcserver.cnaf.infn.it         # define the catalog endpoint
EXIT_VALUE=0
export GRID_PATH=/grid/planck/morgan

help () {
echo "Usage: $0 -i inputfile -s storageelement"
exit 0
}


while getopts  "shi:" flag
do
  if [ "$flag" == i ]; then
          inputfile=$OPTARG
  elif [ "$flag" == h ]; then
        help
  elif [ "$flag" == s ]; then
        storagelelement=$OPTARG
  fi
done

#check if WN has a valid SE otherwhise set the user one
if [ "XXX${VO_PLANCK_DEFAULT_SE}" != "XXX" ]; then
        ping -c 2 ${VO_PLANCK_DEFAULT_SE}
        if [ $? -eq 2 ]; then
                export VO_PLANCK_DEFAULT_SE=$storagelelement
        fi
else
        export VO_PLANCK_DEFAULT_SE=$storagelelement
fi

chmod 700 $PWD/exec.x
$PWD/exec.x -i $inputfile
if [ $? -eq 0 ]; then
        for ((i=0; i<=3; i++))
        do
                if [ -s output$i.dat ]; then
                        echo -n "Saving file output$i.dat on LFC  "
                        lcg-cr --vo planck -d ${VO_PLANCK_DEFAULT_SE} -l lfn:${GRID_PATH}/output${i}.dat file:$PWD/output${i}.dat
                        if [ $? != 0 ]; then
                                EXIT_VALUE=255
                        fi
                        echo done.
                fi
        done
fi

exit $EXIT_VALUE

We prepare a JDL file:

[morgan@localhost JOB_WITH_OUT]$ cat exec.jdl 
Type="Job"; 
JobType="Normal"; 
Executable = "exec.sh"; 
Arguments = "-i inputfile.dat -s gridse.ilc.cnr.it"; 
StdError = "stderr.log"; 
StdOutput = "stdout.log"; 
InputSandbox = {"exec.sh", "exec.c", "inputfile.dat"}; 
OutputSandbox = {"stderr.log", "stdout.log"}; 

The we delegate proxy and submit the job:

[morgan@localhost JOB_WITH_OUT]$ glite-wms-job-submit -d $USER exec.jdl 

Connecting to the service https://glite-rb-00.cnaf.infn.it:7443/glite_wms_wmproxy_server


====================== glite-wms-job-submit Success ======================

The job has been successfully submitted to the WMProxy
Your job identifier is:

https://albalonga.cnaf.infn.it:9000/0T97OYwBbRNylhS1sEZVEA

==========================================================================

If the jobs ends successfully we will find the output in the LFC catalogue.

[morgan@localhost JOB_WITH_OUT]$ lfc-ls /grid/planck/morgan/
output1.dat
output2.dat
output3.dat

Job that requires input data

Suppose that the input data required by the previous application is too big to fit in the InputSandbox (>10MB). In this case it is necessary to use the DMS of the grid to store the input file and the get it from the WN. To reduce the amount of network traffic it is possible to setup the JDL in order to run the job in the same grid site where the data are stored.

DataRequirements attribute is a list of classads representing the data requirements for the job. Each classad have to contain three attributes :

  • InputData: the list of input data needed by the job
  • DataCatalog: the type of data catalog that has to be targeted in order to resolve logical names to physical names
  • DataCatalogType: the URI of the data catalog if this is not the VO default one. T

NOTE: the presence of the DataRequirements attribute causes the job to run on a Computing Element (CE) which is next to the Storage Element (SE) where the requested file is stored. This attribute doesn't perform the actual copy of the file from the SE to the WN; as we will see, this have to be done by the user.

We save the input file in the catalogue

[morgan@localhost JOB_WITH_INPUT]$ lcg-cr --vo planck -l lfn:/grid/planck/morgan/inputfile.dat -d $DPNS_HOST file:/home/morgan/TUTORIAL/WMS/JOB_WITH_INPUT/inputfile.dat 
guid:da91d70e-7d14-4734-b149-6b1d680fdd82
[morgan@localhost JOB_WITH_INPUT]$ echo $LFC_HOST
grid004.oats.inaf.it
[morgan@localhost JOB_WITH_INPUT]$ lfc-ls /grid/planck/morgan
inputfile.dat

and the new exec.sh file will be:

#!/bin/sh

# set global LFC variables on your service

export LCG_GFAL_INFOSYS=egee-bdii.cnaf.infn.it:2170  # define the BDII endpoint 
export LCG_CATALOG_TYPE=lfc                # define the catalog type
export LFC_HOST=grid004.oats.inaf.it         # define the catalog endpoint
EXIT_VALUE=0


help () {
echo "Usage: $0 -i inputfile -s storageelement"
exit 0
}


while getopts  "shi:" flag
do
  if [ "$flag" == i ]; then
          inputfile=$OPTARG
  elif [ "$flag" == h ]; then
        help
  elif [ "$flag" == s ]; then
        storagelelement=$OPTARG
  fi
done

#check if WN has a valid SE otherwhise set the user one
if [ "XXX${VO_PLANCK_DEFAULT_SE}" != "XXX" ]; then
        ping -c 2 ${VO_PLANCK_DEFAULT_SE}   
        if [ $? -eq 2 ]; then
                export VO_PLANCK_DEFAULT_SE=$storagelelement
        fi
else
        export VO_PLANCK_DEFAULT_SE=$storagelelement
fi


lcg-cp --vo planck $inputfile file=$PWD/inputfile.dat
if [ $? -ne 0 ]; then
        echo "inputfile download error"
        exit 250
fi

if [ ! -s $PWD/inputfile.dat ]; then
        echo "input file not found or empty"
        exit 250
fi


chmod 700 $PWD/exec.x 
$PWD/exec.x -i $PWD/inputfile.dat
if [ $? -eq 0 ]; then
        for ((i=0; i<=3; i++))
        do
                if [ -s output$i.dat ]; then
                        echo -n "Saving file output$i.dat on LFC  "
                        lcg-cr --vo planck -d ${VO_PLANCK_DEFAULT_SE} -l lfn:${GRID_PATH}/output${i}.dat file:$PWD/output${i}.dat
                        if [ $? != 0 ]; then
                                EXIT_VALUE=255
                        fi 
                        echo done.
                fi
        done
fi

exit $EXIT_VALUE

the JDL is:

[morgan@localhost JOB_WITH_INPUT]$ cat exec.jdl 
[
        Executable = "exec.sh";
        Arguments = "-i lfn:/grid/planck/morgan/inputfile.dat -s gridse.ilc.cnr.it";
        StdOutput = "std.out";
        StdError = "std.err";

        InputSandbox = {"exec.sh","exec.x"};
        OutputSandbox = {"std.out","std.err"};

        DataRequirements = {
                [
                  InputData = {"lfn:/grid/planck/morgan/inputfile.dat"};
                  DataCatalogType = "DLI";
                  DataCatalog = "http://grid004.oats.inaf.it:8085";
                ]
        };
        DataAccessProtocol = {"rfio","gsiftp"};
        
        RetryCount = 3;
]

Of course only one site can be used (the one that stores the inputfile.dat)

[morgan@localhost JOB_WITH_INPUT]$ glite-wms-job-list-match -a -e https://egee-wms-01.cnaf.infn.it:7443/glite_wms_wmproxy_server exec.jdl 

Connecting to the service https://egee-wms-01.cnaf.infn.it:7443/glite_wms_wmproxy_server

==========================================================================

                     COMPUTING ELEMENT IDs LIST 
 The following CE(s) matching your job requirements have been found:

        *CEId*
 - gridce.ilc.cnr.it:2119/jobmanager-lcgpbs-grid

==========================================================================

[morgan@localhost JOB_WITH_INPUT]$ glite-wms-job-submit -a -e https://egee-wms-01.cnaf.infn.it:7443/glite_wms_wmproxy_server exec.jdl 

Connecting to the service https://egee-wms-01.cnaf.infn.it:7443/glite_wms_wmproxy_server


====================== glite-wms-job-submit Success ======================

The job has been successfully submitted to the WMProxy
Your job identifier is:

https://albalonga.cnaf.infn.it:9000/2sRZ8ILUYNl8e2TYcQBKfA

==========================================================================


[morgan@localhost JOB_WITH_INPUT]$ glite-job-status https://albalonga.cnaf.infn.it:9000/2sRZ8ILUYNl8e2TYcQBKfA


*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://albalonga.cnaf.infn.it:9000/2sRZ8ILUYNl8e2TYcQBKfA
Current Status:     Scheduled 
Status Reason:      Job successfully submitted to Globus
Destination:        gridce.ilc.cnr.it:2119/jobmanager-lcgpbs-grid
Submitted:          Thu Aug 21 15:23:42 2008 CEST
*************************************************************

Using Gridftp access to/from sandbox

Important advantages provided by WM proxy include the possibility to put into the input sandbox files that are located on a gridftp server, as well as the possibility to upload automatically job outputs to a gridftp server. This feature can be exploited on all the kind of jobs submitted to WM proxy.

For example you can use a file uploaded to a SE as input file:

[
  Type = "Job";
  JobType = "Normal";
  Executable = "exec.sh";
  Arguments = "-i filea5f5a851-8cec-4494-9afb-0c59e0625380 -d grid002.oats.inaf.it -p test";
  InputSandbox = {
                 "gsiftp://grid002.oats.inaf.it/dpm/ct.infn.it/home/planck/generated/2008-05-26/filea5f5a851-8cec-4494-9afb-0c59e0625380",
                 "exec.x", "exec.sh"
               };
  StdOutput = "output.txt";
  StdError = "error.txt";
  OutputSandbox = {"output.txt","error.txt"};
  ShallowRetryCount = 1;
]

It is also possible to copy the whole output sendbox to an SE:

[
  Type = "Job";
  JobType = "Normal";
  Executable = "exec.sh";
  Arguments = "-i filea5f5a851-8cec-4494-9afb-0c59e0625380 -d grid002.oats.inaf.it -p test";
  InputSandbox = {
                 "gsiftp://grid002.oats.inaf.it/dpm/ct.infn.it/home/planck/generated/2008-05-26/filea5f5a851-8cec-4494-9afb-0c59e0625380",
                 "exec.x", "exec.sh"
               };
  StdOutput = "output.txt";
  StdError = "error.txt";
  OutputSandboxBaseDestURI = "gsiftp://grid002.oats.inaf.it/dpm/ct.infn.it/home/planck/generated/2008-05-26/";
  OutputSandbox = {"output.txt","error.txt"
                               };
  ShallowRetryCount = 1;
]

Jobs with Requirements

The Requirements attribute allows to add constraints on the computing resources; for instance, you can select a certain resource, also satisfying certain software requirements.

In the first example, you will run the job on one of the 4 queues specified within the requirements attribute; notice the logical operator ||, performing the logical or

Type="Job"; 
JobType="Normal"; 
Executable = "exec.sh"; 
Arguments = "-i inputfile.dat -s gridse.ilc.cnr.it"; 
StdError = "stderr.log"; 
StdOutput = "stdout.log"; 
InputSandbox = {"exec.sh", "exec.c", "inputfile.dat"}; 
OutputSandbox = {"stderr.log", "stdout.log"}; 
Requirements = (
other.GlueCEUniqueID == "hepgrid2.ph.liv.ac.uk:2119/jobmanager-lcgpbs-planck"
 || other.GlueCEUniqueID == "ce01.esac.esa.int:2119/jobmanager-lcgpbs-bpg"
 || other.GlueCEUniqueID == "fal-pygrid-18.lancs.ac.uk:2119/jobmanager-lcgpbs-planck"
 || other.GlueCEUniqueID == "gridce.pi.infn.it:2119/jobmanager-lcglsf-grid4"
);

so if we check the list of resources we will have only the selected ones:

[morgan@localhost JOB_WITH_REQ]$ glite-wms-job-list-match -a -e https://egee-wms-01.cnaf.infn.it:7443/glite_wms_wmproxy_server exec.jdl 

Connecting to the service https://egee-wms-01.cnaf.infn.it:7443/glite_wms_wmproxy_server

==========================================================================

                     COMPUTING ELEMENT IDs LIST 
 The following CE(s) matching your job requirements have been found:

        *CEId*
 - ce01.esac.esa.int:2119/jobmanager-lcgpbs-bpg
 - fal-pygrid-18.lancs.ac.uk:2119/jobmanager-lcgpbs-planck
 - gridce.pi.infn.it:2119/jobmanager-lcglsf-grid4
 - hepgrid2.ph.liv.ac.uk:2119/jobmanager-lcgpbs-planck

==========================================================================

The requirements are very useful to exclude a resource. If a job fails when submitted to one site, but succeeds on another then we can force the WMS to ignore that site add to the JDL a line like:

Requirements = !(RegExp("a.b.institute.country",other.GlueCEUniqueID));

It is also in general good practice to require the WMS to resubmit a job if a failure happens before the job reaches a Worker Node. This is done by, for example:

ShallowRetryCount = 1;

Typical values are 1,2, or 3.

Another use of the Requirements attribute is to select a site that supports a particular software (i.e. the software is installed in the CE and WNs). Suppose for example that the software is SCISOFT, the JDL file will be:

$ cat hostnamereq2.jdl
[
Type="Job"; 
JobType="Normal"; 
Executable = "exec.sh"; 
Arguments = "-i inputfile.dat -s gridse.ilc.cnr.it"; 
StdError = "stderr.log"; 
StdOutput = "stdout.log"; 
InputSandbox = {"exec.sh", "exec.c", "inputfile.dat"}; 
OutputSandbox = {"stderr.log", "stdout.log"}; 
Requirements =  ( RegExp("inaf.it",other.GlueCEUniqueId)
                  &&
                  Member("GEANT4-6",other.GlueHostApplicationSoftwareRunTimeEnvironment));
ShallowRetryCount = 3;
]

The RegExp("inaf.it",other.GlueCEUniqueId) specify that the site where the job will run must be within the domain inaf.it. In this case we are using a Regular Expression.

Ranking the resources

The Rank attribute is a Floating-Point expression that states how to rank CEs that have already met the Requirements expression. Essentially, rank expresses a preference. A higher numeric value equals a better rank.

CEs eligible to run a job specified by a given JDL file:

Type="Job"; 
JobType="Normal"; 
Executable = "exec.sh"; 
Arguments = "-i inputfile.dat -s gridse.ilc.cnr.it"; 
StdError = "stderr.log"; 
StdOutput = "stdout.log"; 
InputSandbox = {"exec.sh", "exec.c", "inputfile.dat"}; 
OutputSandbox = {"stderr.log", "stdout.log"}; 
Requirements = (
other.GlueCEUniqueID == "hepgrid2.ph.liv.ac.uk:2119/jobmanager-lcgpbs-planck"
 || other.GlueCEUniqueID == "ce01.esac.esa.int:2119/jobmanager-lcgpbs-bpg"
 || other.GlueCEUniqueID == "fal-pygrid-18.lancs.ac.uk:2119/jobmanager-lcgpbs-planck"
 || other.GlueCEUniqueID == "gridce.pi.infn.it:2119/jobmanager-lcglsf-grid4"
);
Rank = other.GlueCEStateFreeCPUs;

Perform the job list match:

[morgan@localhost JOB_WITH_REQ]$ glite-wms-job-list-match -a -e https://egee-wms-01.cnaf.infn.it:7443/glite_wms_wmproxy_server exec.jdl 

Connecting to the service https://egee-wms-01.cnaf.infn.it:7443/glite_wms_wmproxy_server

==========================================================================

                     COMPUTING ELEMENT IDs LIST 
 The following CE(s) matching your job requirements have been found:

        *CEId*
 - hepgrid2.ph.liv.ac.uk:2119/jobmanager-lcgpbs-planck
 - fal-pygrid-18.lancs.ac.uk:2119/jobmanager-lcgpbs-planck
 - ce01.esac.esa.int:2119/jobmanager-lcgpbs-bpg
 - gridce.pi.infn.it:2119/jobmanager-lcglsf-grid4

==========================================================================

As you can see the order is different from the one obtained above, first CE where the job will be sent for submission is the one with the highest number of free CPUs.

Special Jobs

Job COLLECTIONS

A job collection is a set of mutually independent jobs, which, for some reason known to the user, needs to be submitted, monitored and controlled as a single request. A good reason could be that the sub-jobs have common input files : in fact WM proxy allows the sharing and inheritance of sandboxes, and optimizes network traffic, transferring a single copy of each file even in case of multiple uses in sub-jobs.

[
  Type = "collection";
  InputSandbox = { "input_common1.dat",  "input_common2.dat"  };
 
  nodes = {
                   [
                        JobType = "Normal";
                        NodeName = "node1";
                        Executable = "exec.sh";
                        Arguments = "-i input_common1.dat -d gridse.ilc.cnr.it -p node1";
                        InputSandbox = {"exec.sh", "exec.x", root.InputSandbox[0] };
                        StdOutput = "stdout.1";
                        StdError  = "stderr1";
                        OutputSandbox = {"stdout.1","stderr.1"};
                        ShallowRetryCount = 1;
                   ],
                  [
                        JobType = "Normal";
                        NodeName = "node2";
                        Executable = "exec.sh";
                        InputSandbox = {"exec.sh", "exec.x",  root.InputSandbox[1] };
                        Arguments = "-i input_common2.dat -d gridse.ilc.cnr.it -p node2";
                        StdOutput = "stdout.2";
                        StdError  = "stderr.2";
                        OutputSandbox = {"stdout.2","stderr.2"};
                       ShallowRetryCount = 1;
                   ],
                   [
                        JobType = "Normal";
                        NodeName = "node3";
                        Executable = "/bin/cat";
                        InputSandbox = {root.InputSandbox};
                        Arguments = "*.dat";
                        StdOutput = "stdout.3";
                        StdError  = "stderr.3";
                        OutputSandbox = {"stdout.3","stderr.3"};
                       ShallowRetryCount = 1;
                      ]
           };
 Requirements  = (other.GlueCEStateFreeCPUs>1)
]

In this example, three jobs are run; there's a common InputSandbox, from which node1 and node2 inherits one file each, and type it from their own bash script; node3 inherit the full InputSandbox and type the content of both files. The bash scripts just print out some info and every dat file have in the working directory. The executable exec.sh is:

[morgan@localhost JOB_PARAMETRIC]$ cat exec.sh
#!/bin/sh

# set global LFC variables on your service

export LCG_GFAL_INFOSYS=egee-bdii.cnaf.infn.it:2170  # define the BDII endpoint 
export LCG_CATALOG_TYPE=lfc                # define the catalog type
export LFC_HOST=lfcserver.cnaf.infn.it         # define the catalog endpoint
EXIT_VALUE=0


help () {
echo "Usage: $0 -i inputfile -s storageelement"
exit 0
}


while getopts  "p:d:hi:" flag
do
  if [ "$flag" == i ]; then
          inputfile=$OPTARG
  elif [ "$flag" == h ]; then
        help
  elif [ "$flag" == d ]; then
        storagelelement=$OPTARG
  elif [ "$flag" == p ]; then
        GRID_PATH=$OPTARG
  fi
done
#check if WN has a valid SE otherwhise set the user one
if [ "XXX${VO_PLANCK_DEFAULT_SE}" != "XXX" ]; then
        ping -c 2 ${VO_PLANCK_DEFAULT_SE}
        if [ $? -eq 2 ]; then
                export VO_PLANCK_DEFAULT_SE=$storagelelement
        fi
else
        export VO_PLANCK_DEFAULT_SE=$storagelelement
fi

chmod 700 $PWD/exec.x
$PWD/exec.x -i $inputfile
        for ((i=0; i<=3; i++))
        do
                if [ -s output$i.dat ]; then
                        echo -n "Saving file output$i.dat on LFC  "
                        lcg-cr --vo planck -d ${VO_PLANCK_DEFAULT_SE} -l lfn:/grid/planck/morgan/${GRID_PATH}/output${i}.dat file:$PWD/output${i}.dat
                        if [ $? != 0 ]; then
                                EXIT_VALUE=255
                        fi 
                        echo done.
                fi
        done

exit $EXIT_VALUE

Submit and monitor the job; it's easier if saving the job ids in a file, jobId in this case

glite-wms-job-submit -a -e https://egee-wms-01.cnaf.infn.it:7443/glite_wms_wmproxy_server exec.jdl 
[morgan@localhost JOB_PARAMETRIC]$ glite-wms-job-submit -a -e https://egee-wms-01.cnaf.infn.it:7443/glite_wms_wmproxy_server -o jdl.list collection1.jdl 

Connecting to the service https://egee-wms-01.cnaf.infn.it:7443/glite_wms_wmproxy_server


====================== glite-wms-job-submit Success ======================

The job has been successfully submitted to the WMProxy
Your job identifier is:

https://albalonga.cnaf.infn.it:9000/tHNNv5WFoeVdLD4NbEqmSQ

The job identifier has been saved in the following file:
/home/morgan/TUTORIAL/WMS/JOB_PARAMETRIC/jdl.list

==========================================================================


[morgan@localhost JOB_PARAMETRIC]$ glite-wms-job-status -i /home/morgan/TUTORIAL/WMS/JOB_PARAMETRIC/jdl.list


*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://albalonga.cnaf.infn.it:9000/tHNNv5WFoeVdLD4NbEqmSQ
Current Status:     Waiting 
Submitted:          Thu Aug 21 16:59:04 2008 CEST
*************************************************************

- Nodes information for: 
    Status info for the Job : https://albalonga.cnaf.infn.it:9000/7rsjiEh2O8HHb0MTKHehUw
    Current Status:     Ready 
    Status Reason:      unavailable
    Destination:        gridba2.ba.infn.it:2119/jobmanager-lcgpbs-infinite
    Submitted:          Thu Aug 21 16:59:04 2008 CEST
*************************************************************
    
    Status info for the Job : https://albalonga.cnaf.infn.it:9000/9ALrN_nh9xHv8XkBvTm3JQ
    Current Status:     Ready 
    Status Reason:      unavailable
    Destination:        heplnx206.pp.rl.ac.uk:2119/jobmanager-lcgpbs-planck
    Submitted:          Thu Aug 21 16:59:04 2008 CEST
*************************************************************
    
    Status info for the Job : https://albalonga.cnaf.infn.it:9000/nvVROd4nnFDF4SU0afcBXA
    Current Status:     Ready 
    Status Reason:      unavailable
    Destination:        grid012.ct.infn.it:2119/jobmanager-lcglsf-short
    Submitted:          Thu Aug 21 16:59:04 2008 CEST
*************************************************************

Note that the node names are the ones specified in JDL file. When all the jobs are finished, download and verify the job output. The following example uses --dir to create a new directory for the output:

glite-wms-job-output --dir ./myOp  -i jobId

Parametric jobs

A parametric job causes a set of jobs to be generated from one JDL file. This is invaluable in cases where many similar (but not identical) jobs must be run. This is achieved by the parametric job having one or more parametric attributes described in the JDL. These attributes are identified by use of the key word _PARAM_ in its value; that value will be replaced by the actual value of Parameters during the jdl expansion. The JobType in the JDL is Parametric.

In this example a set of values of the parameters is defined by the three attributes:

  • Parameters: the value where the submission of jobs will stop (that value itself is not used);
  • ParameterStep: the step for each variation
  • ParameterStart defines the starting value for the variation.

The number of jobs is (Parameters – ParameterStart) / ParameterStep

[
 JobType = "Parametric";
 Parameters= 5;
 ParameterStep =2;
 ParameterStart = 1;
 Executable = "exec.sh";
 Arguments = "-i inputfile_PARAM_.dat -d gridse.ilc.cnr.it -p node_PARAM_";
 InputSandbox = {"exec.sh", "exec.x",inputfile_PARAM_.dat };
 StdOutput = "stdout._PARAM_";
 StdError  = "stderr._PARAM_";
 OutputSandbox = {"stdout._PARAM_","stderr._PARAM_"};
 ShallowRetryCount = 1;
 Requirements  = (other.GlueCEStateFreeCPUs>1)
]

In this case, 2 jobs will be generated, from the execution of the program with two inputs: inputfile1.dat and inputfile3.dat (the last value of Parameters being excluded). Note that the coherence of the Input Sandbox is up to the user.

[morgan@localhost JOB_PARAMETRIC]$ glite-wms-job-submit -a -e https://egee-wms-01.cnaf.infn.it:7443/glite_wms_wmproxy_server -o jdl.list parametric.jdl 

Connecting to the service https://egee-wms-01.cnaf.infn.it:7443/glite_wms_wmproxy_server


====================== glite-wms-job-submit Success ======================

The job has been successfully submitted to the WMProxy
Your job identifier is:

https://albalonga.cnaf.infn.it:9000/QXHLhrGR1HttPBt4GI6jDw

The job identifier has been saved in the following file:
/home/morgan/TUTORIAL/WMS/JOB_PARAMETRIC/jdl.list

==========================================================================

[morgan@localhost JOB_PARAMETRIC]$ glite-job-status -i jdl.list 

*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://albalonga.cnaf.infn.it:9000/QXHLhrGR1HttPBt4GI6jDw
Current Status:     Waiting 
Submitted:          Thu Aug 21 17:52:24 2008 CEST
*************************************************************

- Nodes information for: 
    Status info for the Job : https://albalonga.cnaf.infn.it:9000/QbTrIN5vT5U-a32xF6eOXQ
    Current Status:     Waiting 
    Status Reason:      unavailable
    Destination:        ce01.ariagni.hellasgrid.gr:2119/jobmanager-pbs-planck
    Submitted:          Thu Aug 21 17:52:24 2008 CEST
*************************************************************
    
    Status info for the Job : https://albalonga.cnaf.infn.it:9000/hIuewvF17PUlBPga9cLizQ
    Current Status:     Scheduled 
    Status Reason:      Job successfully submitted to Globus
    Destination:        grid10.lal.in2p3.fr:2119/jobmanager-pbs-planck
    Submitted:          Thu Aug 21 17:52:24 2008 CEST
*************************************************************

when the jobs have completed, you would see in the different subdirectories the different outputs, each with the parametrised name.

DAG jobs

A DAG (directed acyclic graphs) represents a set of jobs where the input the output or the execution of one or more jobs depends on one or more jobs.The jobs are nodes (vertices) in the graph and the edges (arcs) represents the dependencies. For example:

[
  Type = "dag";
  InputSandbox = { "exec.sh", "exec.c" };
  nodes = [
          father = [
                   description = [
                                 JobType = "Normal";
                                 Executable = "exec.sh";
                                 Arguments = "-i inputfile1.dat -d gridse.ilc.cnr.it -p father";
                                 InputSandbox = {root.InputSandbox, "inputfile1.dat"};
                                 StdOutput = "father.output";
                                 StdError  = "father.error";
                                 OutputSandbox = {"father.output","father.error","output1.dat","output2.dat"};
                                 ShallowRetryCount = 1;
                                 ];
                   ];
          son1 =   [
                   description = [
                                 JobType = "Normal";
                                 Executable = "exec.sh";
                                 InputSandbox = {root.InputSandbox,root.nodes.father.description.OutputSandbox[2]};
                                 Arguments = "-i output1.dat -d gridse.ilc.cnr.it -p son1";
                                 StdOutput = "son1.output";
                                 StdError  = "son1.error";
                                 OutputSandbox = {"output1.dat","son1.output","son1.error"};
                                 ShallowRetryCount = 1;
                                 ];
                   ];
          son2  =  [
                   description = [
                                 JobType = "Normal";
                                 Executable = "/bin/sh";
                                 InputSandbox = {root.InputSandbox,root.nodes.father.description.OutputSandbox[3]};
                                 Arguments = "-i output2.dat -d gridse.ilc.cnr.it -p son1";
                                 StdOutput = "son2.output";
                                 StdError  = "son2.error";
                                 OutputSandbox = {"output2.dat","son2.output","son2.error"};
                                 ShallowRetryCount = 1;
                                 ];
                                 ];
          final =  [
                   description = [
                                 JobType = "Normal";
                                 Executable = "/bin/cat";
                                 InputSandbox = {"root.nodes.son1.description.OutputSandbox[0],root.nodes.son2.description.OutputSandbox[0]};
                                 Arguments = "output1.dat output2.dat";
                                 StdOutput = "dag.out";
                                 StdError  = "dag.err";
                                 OutputSandbox = {"dag.out","dag.err"};
                                 ShallowRetryCount = 1;
                                 ];
                    ];
   dependencies =  {
       {father,{son1,son2}},
       {son1,final}, {son2,final}
     };
 ];
]

You may note several things :

  • There can be a set of common attributes, which are valid for all of the subjobs. In this example a common Input Sandbox is defined

  • The "dependencies", which schedules the order of execution, are defined as pairs of type {A,B}, that states that B cannot be executed before A has finished. Each element could also be a pair itself, as in the first case, which is useful to shorten notations. It could have been written as {father,son1},{father,son2}.

  • Files are passed between nodes using the notation seen above for collections : in this example son1 and son2 inherit a different output file from father, and they produce two different outputs which are then inherited by the final node

MPI Jobs

MPI jobs run parallely on several nodes. At JDL level they differ from the other jobs because of a dedicated JobType ("MPICH") and the NodeNumber attribute (specifying how many nodes are required to run the job). This examplar jdl assumes you have the mpi executable cpi in the same path as where you are creating the jdl file.

[
 Type = "Job";
 JobType = "MPICH";

 Executable = "cpi";
 NodeNumber = 2;

 StdOutput = "cpi.out";
 StdError = "cpi.err";

 InputSandbox = {"cpi"};
 OutputSandbox = {"cpi.err","cpi.out"};

 RetryCount = 0;
]

References

WM proxy user guide

JDL Attributes guide for WM proxy

Datamat -- WM proxy quickstart

-- TaffoniGiuliano - 20 Aug 2008

Topic revision: r5 - 21 Aug 2008 - 19:05:22 - TaffoniGiuliano
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback