SGE

From MerlinWiki
Jump to: navigation, search

This page is intended to share the experience with using SGE at FIT. At the moment, we are using open-source version Grid Engine 6.2.

Basic info on SGE is here: http://www.fit.vutbr.cz/CVT/cluster/sge.php (mainly overview of command line tools and their parameters, here you'll find mainly the rules and more general info)

Specific info for storage (for Speech@FIT members) is at Speech Wiki:Storage Policy

Some more info in English also here (but do not take it literally!): http://biowiki.org/HowToUseSunGridEngine, http://arc.liv.ac.uk/SGE/howto/

Usage stats: https://merlin.fit.vutbr.cz/sge-stats and https://merlin.fit.vutbr.cz/ssd-cache/

Getting SGE account

In case you're a new user and you'd like to get the access to SGE, the following steps are obligatory:

  • Study CAREFULLY the following documents: http://merlin.fit.vutbr.cz/wiki/index.php?title=SGE, http://www.fit.vutbr.cz/CVT/cluster/sge.php
  • prepare example scripts that you'll want to run (you will have to present these to the admin so that he can check that you really know what and how you want to do. )
  • make appointment with some of existing SGE users (optimally someone that is doing a work similar to yours), that will be your mentor and will be responsible for what you're doing in SGE (i.e. someone who knows SGE well and is able to help you).

At the end, make an appointment with the admin mailto:linux@fit.vutbr.cz with the above mentioned materials.

I am a beginner

  • Read this page carefully ! You will be asked questions about it.
  • In case you write a job to be submitted to SGE, it is necessary to first test locally if it really does what it should do.
  • never use your $HOME (on kazi or eva) for anything related with SGE (data, scripts etc.), overloading these servers will make a lot of people angry!
  • Estimate, how much memory the job will need (for example by running top in other terminal) - and set resources, for example -l ram_free=500M,mem_free=500M (the values are the ones that we have seen in top output + some margin). Attention to the changes of needed memory during the computation, or wih different input data !
  • Estimate how much the process loads disks (writes are more expensive than reads), - and set -l disk_server=X according to the disk server on which you're working. There can be more such servers, for examples if I read from matylda8 and in the same time I write to scratch10, I have to specify both: -l matylda8=1,scratch10=10
  • Estimate, how long the process will run - tasks of units of minutes (15 - 30), are optimal. Too short tasks are bad for SGE scheduler, too long tasks take too much processing power and prevent equal distribution of resources between users. Example: there is noone in SGE, so I will run tasks that are longer than 3 hours on all computers. In case someone submits to SGE after you do this, he/she will have to wait until my tasks are finished. Now imagine you're the other user - he/she will have to wait 3 hours until all my tasks are finished. Therefore, it is necessary to submit tasks of reasonable lengths if possible or use the proper method for longer jobs.
  • In case you have a task that needs to be distributed on several machines, do not divide it based on the amount computers but based on the total duration the task would take. How to do this ? I estimate, how long a part of it takes, for example processing of 10 files. In case I find out, that this fulfills the condition above (i.e. it takes about 20 minutes), I will prepare X tasks (X is arbitrary) where each task will process 10 files.
  • Which computers should/can I use ? Most of the time stable, but this can be changed according to your needs, for example if you need more computers, you'll want to go also to unstable ones...
  • If it is possible, work with local /tmp on each computer: for example, I process one file and during my computation, I create several other ones until I reach the final result that I need to store. The middle-products should be stored to /tmp/$USER which you have to delete at the end of the script. This way does not put useless load on the disk servers and the network. It is necessary to use -l tmp_free,disk_free. Files in /tmp, in case they are not touched for a month, are automatically deleted !
  • In scripts, it is good to check if you're not computing again some thing you've already computed. For example, check if a file is already there, if yes, skip it.
  • Regarding the load generated, it is not allowed in any case to run computations on servers (for example merlin). For tuning of your tasks, use employees' computers or computers in the PC labs. The servers (merlin) can be used only in case the tuning does not require extra resources, i.e. run time not longer than 10min (of course with minimum priority - nice), occupied RAM less than 100-150MB and data loaded and stored to networked storage (matylda's, /pub/tmp, /tmp) less than ~50-100MB.

Auxiliary tools

  • qstat-no - summary of info on the usage of SGE, available-free nodes, and status of file-servers.
  • sge.log - alias for tail for visualization of SGE log. Use in case your task crashes or ends in E state.
  • sge.cat-job - without parameters, lists ID of all jobs actually in SGE. With job ID, it prints that job's script the way how SGE sees it.
  • sge-resources - info about resource usage per user and per resource
  • sge-long - info about long jobs per user

Limitation of access

Blade servers have direct access limited. In case of problems, they are accessible from svatava (ulrika), under normal circumstances, they are not (as well as svatava, ulrika) intended for interactive mode - use SGE to run jobs on them. The use of SGE is limited only to activities in line with FIT orientation (bachelor/diploma projects, research, etc.) and you have to apply for it.

Use of priorities

In the new version of SGE, it is possible to use two modes of job prioritization.

  • using parameters -p NUM, where NUM is a number -1 .. -1023, it is possible to lower the priority of a job within all jobs of all users waiting in the queue.
  • in addition to this, it is possible to use parameter -js NUM (job shares), with the value of NUM from 0 to 100. This will set the priority of jobs from one user (your jobs) one to another.

Jobs with the same priority are processed according to the time they came to the queue (i.e. FIFO). The default is -p 0 and -js 0.

Some hints

In case you need to resolve dependencies, use qsub/qalter -hold_jid WJID (xxxx for qsub/JID for qalter) where WJID is ID of the job that is waited for (the job JID will be holded-off until WJID is finished).

Jobs longer than 3 hours

There are two queues in SGE. The default one (all.q) is for short jobs, has no limitations on number of machines or slots, but each job is limited to 4 hours of real time (not CPU time!). If you need longer job, you have to use long.q queue which has time limit set to 14 days. Even longer jobs need to be discussed individually.

To apply for long jobs using long.q queue use qsub -q long.q@@stable instead of all.q@@stable or long.q@@speech instead of all.q@@speech. The general format is QUEUE@HOST_DEFINITION.

Currently there can be up to 150 long jobs, but only 40 per user. All machines are in both queues, but this may change in future.

It is strongly recommended to add a limitation of the duration of a job run-time to the start of your script: ulimit -t XX, where XX is estimated run-time in seconds - upper bound, i.e. 3 hours (10800). It is useful to use the same even for really long jobs --- if you expect the job to take 3 days to finish, it does not need to stay there for a month in case something goes wrong and the job loops forever.

Interactive work

In case you need to run interactive programs (such as Matlab, etc.) do not log on directly using ssh, but use qlogin. This will log you on one of free nodes and run a terminal thereon. The advantage is that this login is logged as a special SGE job, no other jobs are submitted to the slot you're logged on and there is no risk of overloading the node. However, after you finish, you MUST log off so that the node is freed for normal (non-interactive) jobs! Currently, such access to blades is limited only from servers svatava a ulrika, it is therefore necessary to run qlogin from these servers (if you have blades in your list of queues)! You are allowed to run only one process at a time this way, no parallel processing is possible as you have dedicated one slot on a machine!

Implicitly, in case qlogin is used and there are no free resources, the program will finish in relatively short time. In case you do want to wait for a free slot, use

qlogin -now n

If you plan to use Matlab in your interactive session, read section MATLAB on SGE below for proper CPU handling within matlab.

Reserving resources (RAM, disc, GPU, SSD)

  • matyldaX
  • scratchX
  • ram_free, mem_free
  • disk_free, tmp_free
  • gpu, gpu_ram, ssd

You have to specify all requested resources your job would like to use in SGE. It is specially important in case an excessive use of RAM/netowrk storage is expected. The limits are soft and hard (parameters -soft, -hard), the limits themselves are:

 -l resource=value

For example, in case a job needs at least 400MB RAM: qsub -l ram_free=400M my_script.sh Another often requested resource is the space in /tmp: qsub -l tmp_free=10G my_script.sh. Or both:

qsub -l ram_free=400M,tmp_free=10G my_script.sh

Of course, it is possible (and preferable if the number does not change) to use the construction #$ -l ram_free=400M directly in the script. The actual status of given resource on all nodes can be obtained by: qstat -F ram_free, or more things by: qstat -F ram_free,tmp_free.

Details on other standard available resources are in /usr/local/share/SGE/doc/load_parameters.asc. In case you do not specify value for given resource, implicit value will be used (for space on /tmp it is 1GB, for RAM 100MB)

WARNING: You need to distinguish, if you request resources that are available at the time of submission (so called non-consumable resources), or if you need to allocate given resource for the whole runtime of your computation - for example, your program will need 400MB of memory but in the first 10 min of computation, it will allocate only 100MB. In case you use the standard resource mem_free, and during the first 10min another jobs will be submitted to the given node, SGE will interpret it in the following way: you wanted 400MB but you finally use only 100MB so that the rest of 300MB will be given to someone else (i.e. it will submit another task requesting this memory).

For these purposes, it is better to use consumable resources, that are computed independently on the current status of the task - for memory it is ram_free, for disc tmp_free. For example, resource ram_free does not look at the actual free RAM, but it computes the occupation of RAM only based on the requests of individual scripts. It works with the size of RAM of the given machine and subtracts the amount requested by the job that should be run on this machine. In case the job does not specify ram_free, implicit value of ram_free=100M will be used.

For the disk space in /tmp (tmp_free), the situation is more tricky: in case a job does not clean up properly its mess after it finishes, the disk can actually have less space than defined by the resource. Unfortunately, nothing can be done about this.

To ask for a machine with a local SSD disk (fast) in /mnt/ssd use

-l ssd=1

To ask for a gpu machine with specific size of GPU RAM use

-l gpu=1,gpu_ram=4G

Known problems with SGE

  • Use of paths - for home directory it is necessary to use the official path - i.e. /homes/kazi/... or /homes/eva (or simply the variable $HOME). In case the path of the internal mountpoint of the automounter is used - i.e. - /var/mnt/... an error will occur. (this is not an error of SGE, the internal path is not fully functional for access)
  • Availability of nodes - due to the existence of nodes with limited access (employees' PCs), it is necessary to specify a list of nodes, on which your job can run. This can be done using parameter -q. The machines that are available are nodes in IBM Blades and also some computer labs in case you turn the machines on over night. The list of queues for -q must be only on one line even if it is very long. For the availability of given groups of nodes, the parameter -q can be used in the following way:
#$ -q all.q@@blade,all.q@@PCNxxx,all.q@@servers

Main groups of computers are: @blade, @servers, @speech, @PCNxxx, @PCN2xxx - the full and actual list can be obtained by qconf -shgrpl

  • The syntax for access is QUEUE@OBJECT - i.e. all.q@OBJECT. The object is either one computer, for example all.q@svatava, or a group of computers (which begins also by @ - @blade) i.e. all.q@@blade.
  • The computers in the labs are sometimes restarted by students during computation - we can't do much about this. In case you really need the computation to finish (i.e. it is not easy to re-run a job in case it is brutally killed) use newly defined groups of computers:
@stable - @blade, @servers - servers that run all the time w/o restarting
@gpu - machines with GPUs for CUDA computing
@PCOxxx, @PCNxxx - computer labs, there is a possibility that any node might be restarted at any time,
      a student or someone can shut the machine down by error or "by error". It is more or less sure that these
      machines will run smoothly over night and during weekends. There is also a group for each independent lab e.g. @PCN103.
  • for GPU jobs, use -lgpu=1 (or more if you use more GPUs on one machine) and @gpu host group.
  • Runnnig other scripts than bash - it is necessary to specify the interpret on the first line of your script (it is probably already there), for example #!/usr/bin/perl, etc.
  • Does your script generate a heavy traffic on matyldas ? It is necessary to set -l matyldaX=10, (for example 10 - i.e. in total 100/10 = 10 concurrent jobs from given matyldaX), where X is the number of matylda used (in case you use several matyldas, specify -l matyldaX=Y several times). We have created an SGE resource for each matylda (each matylda has 100 points in total) and the jobs using -l matyldaX=Y are submitted until given matylda has free points. This can be used to balance the load of given storage server from the user side. The same holds for servers scratch0X.
  • Attention to parameter -cwd, is is not guaranteed that it will work all the time, better use cd /where/do/i/want at the beginning of your script.
  • In case a node is restarted, a job will still be shown in SGE, although it is not running any more. This is because SGE is waiting until the node confirms termination of the computation (i.e. until it boots Linux again and starts the SGE client). In case you use qdel to delete a job, it will be only marked by flag d. Jobs marked by this flag are automatically deleted by the server every hour.

Parallel jobs - OpenMP

For parallel tasks with threads, it is enough to use parallel environment smp and to set the number of threads:

#!/bin/sh 
#
#$ -N OpenMPjob
#$ -o $JOB_NAME.$JOB_ID.out
#$ -e $JOB_NAME.$JOB_ID.err
#
# PE_name    CPU_Numbers_requested
#$ -pe smp  4
#
cd SOME_DIR_WITH_YOUR_PROGRAM
export OMP_NUM_THREADS=$NSLOTS
 
./your_openmp_program [options]

Parallel jobs - OpenMPI

  • Open MPI is now fully supported, and it is the default parallel environment (mpirun is by default Open MPI)
  • The SGE parallel environment is openmpi
  • The allocation rule is $fill_in$ which means that the preferred allocation is on the same machine.
  • Open MPI is compiled with tight SGE integration:
    • mpirun will automatically submit to machines reserved by SGE
    • qdel will automatically clean all MPI stubs
  • In the parallel task, do not forget (preferably directly in the script) to use parameter -R y, this will turn on the reservation of slots, i.e. you won't be jumped by processes requesting less slots.
  • in case a parallel task is launched using qlogin, there is no variable containing information on what slots were reserved. A useful tool is then qstat -u `whoami` -g t | grep QLOGIN, which says what parallel jobs are running.

Listing follows:

#!/bin/bash
# ---------------------------
# our name 
#$ -N MPI_Job
#
# use reservation to stop starvation
#$ -R y
#
# pe request
#$ -pe openmpi 2-4
#
# ---------------------------
# 
#   $NSLOTS          
#       the number of tasks to be used

echo "Got $NSLOTS slots."

mpirun -n $NSLOTS /full/path/to/your/executable

MATLAB on SGE

From version 2008, MATLAB implicitly runs on multiple cores (e.g. on Welda's, 32 cores are used by implicit). This behavior is not compatible with the SGE policy, which reserves one core for one slot. Therefore you are required either to make MATLAB run on one core as described above, or reserve multiple slots.

For Matlab 2009 and newer start matlab as:

matlab -singleCompThread