Using the cluster

Panthera overview#

Since all users have access to the cluster, they need both a resource manager and a queuing system, which we have used the slurm program in our cluster.

The basis of resource allocation in a cluster is based on the queuing mechanism. If there are free resources available, your job will be run, and if the resources are not available at that moment, your job will still wait in the queue.

Tip

The base operating system of our cluster, like most ones, is Linux. As a rule, the software you need to work with, should have a Linux version, so you can run it on the cluster.

Info

for more information about Panthera architecture and its hardware specification, please visit this page.

Job types on Panthera#

Before we describe the types of jobs, it is necessary to say what kind of applications a cluster can be useful for:

Software written in parallel so that you can use several cores of a single compute node or multiple nodes (depending on the mass of calculations) that increase the total speed of your calculations.
If you don't have enough resources in your personal system.
Your program is not parallel, but you need to have many executions (with various inputs). In this case, you can send and execute several jobs to the cluster at the same time.

Generally, there are two strategies for running programs in parallel:

shared memory
distributed memory

Both of these are known as OpenMP or MPI. In OpenMP, you can finally use the cores of one system, and your program does not have the ability to use the processors of several systems at the same time. In distributed memory mode, you can engage the processors of several cores and use them.

Note

The number of more cores does not necessarily mean the higher program execution speed.

Parallel programs always have an optimal limit of the number of received cores, and to some extent you face a linear speed increase, and after that, the speed of your executions decreases, and maybe with the increase in the number of cores, the speed of your executions becomes slower.

In general, jobs are executed in two ways in the cluster:

interactive: since you may need to compile a code before running your program, or you may want to download some libraries you need from the Internet, or you may need an initial test of your code, you can use this type of jobs.
non interactive

It is important to mention that the resources you request through interactive jobs must be available to you at the same moment, and since the cluster resources are limited, that resources will be allocated to you with a smaller number and less time.

Info

Interactive jobs are not suitable for long-term executions. And you should prepare and send your job in the form of non interactive via batch scripts.

Partitions#

Usually, there are a series of pre defined groups know as partitions on clusters, each of which consists of a series of nodes. to see the list of partitions use sinfo command.

u111111@login1:~> sinfo -s
PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
allnodes     up   infinite       8/18/19/45 cn-2-[1-9],cn-11-[1-8],cn-12-[1-8],cn-13-[1-9],en-1-[1-4],en-7-[1-2,5-9]
short        up      30:00          0/1/0/1 en-7-5
gpu          up 10-00:00:0          0/2/0/2 en-7-[1-2]
amd128       up 30-00:00:0          4/3/1/8 en-1-[1-4],en-7-[6-9]
amd48*       up 60-00:00:0        4/12/9/25 cn-11-[1-8],cn-12-[1-8],cn-13-[1-9]

Info

Partitions marked with * are default ones.

If you want the see details of a specific partition use this command:

u111111@login1:~> scontrol show partition amd128
PartitionName=amd128
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=2 MaxTime=30-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
   Nodes=en-1-[1-4],en-7-[6-9]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=512 TotalNodes=8 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=819200
   TRES=cpu=512,mem=8255024M,node=8,billing=334006
   TRESBillingWeights=CPU=180,Mem=30G

Here are a brief introduction of each partition

short: which has a time limit of 30 minutes and is connected to the Internet.

If you are using a program that requires internet access (for example pytorch) and you want to install a package for it, you must first create an interactive job on this partition and download and install your package.

gpu: this partition is useful for jobs need to use our graphics cards. There are currently two nodes in this partition.
amd128: This partition contains our epyc systems, which have 64 real cores (or 128 threads) and 1 TB of memory.
amd48: This partition contains our old 48 cores with 96GB of memory

Info

On Panthera, as much software as possible is installed on both amd48 and amd128 partitions. But some software such as Gaussian16 are only installed on amd128 partition.

CPU types#

Partition	CPU type
short	thread
gpu	thread
amd128	real
amd48	real

Warning

Each partition has a time limit. Every job you send must be based on that time limit. And definitely, because they are limited, your program should be finished in that time frame and if it takes longer than that and is interrupted, it should be able to continue its work after re-execution.

Submitting jobs#

Before submitting a job, you should pay attention to these points:

Linux commands are case sensitive.
Do not use special characters like ({[&#$@ for naming files and directories.
Put your input files on wrkdir directory and run the command from that location (relative path).
If you have written your input file in Windows, after uploading it in the cluster, be sure to run the dos2unix command on it.
```
u111111@login1:wrkdir> dos2unix my_input_file 
```
If you use the Windows platform to write your files, try to use the Notepad++ software as much as possible.
When you want to prepare your job script, you must define the resources (like number of CPUs, RAM capacity, ...) you need according to the Slurm instructions (please refer to these examples). And then the module related to your software should be loaded and the execution command of your program should be written at the end.
If your program produces a lot of scratch files, try moving your input files to the /tmp folder. In such a way that you first specify that the files are transferred to the /tmp path, the execution commands are written there, and then the results are returned to your wrkdir directory.

Once your script is ready, submit your job using the sbatch command.

u111111@login1:wrkdir> sbatch your_script

After submitting, you can view the status of your jobs using the sq command. If the output does not show you anything, it means that your job is finished (successfully or failed).

u111111@login1:wrkdir> sq
     JOBID     USER       NAME   ST NODE CPUS MEMORY (SUBMIT_)TIME     TIME_LEFT  NODELIST(REASON)
     43807  u111183 P1-methylr    R    1    8    16G    9-05:52:24    2-18:07:36  cn-11-1
     43904  u111142 Co-CO2-opt    R    1   16 33024M    6-04:06:39      19:53:21  cn-11-2
     46884  u111142 Ni-H2-CHOH    R    1   16 33024M    2-01:55:35      22:04:25  cn-11-4
     46948  u111154      wna61    R    1   16    11G    1-03:47:12    8-20:12:48  en-7-8
     46949  u111154      wnah5    R    1   16    11G    1-03:46:44    1-20:13:16  en-7-9
     46971  u111183 P2-methylr    R    1   16     6G      21:59:29   11-02:00:31  cn-11-2
     47657  u111176       comp    R    1   16    50G       9:37:12    1-14:22:48  en-7-6
     47665  u111132 Pd13_ffreq    R    1   16     6G       5:44:27    1-18:15:33  en-7-8
     47686  u111186 GA_Procedu    R    1   64   100G       1:38:31    1-22:21:29  en-1-1
     47691  u111175 SAYDCon4_R    R    1    4     4G         43:48    2-01:16:12  en-7-6
     47692  u111142 Fe-ts-nics    R    1   16 33024M          9:05      23:50:55  cn-11-1
     47693  u111175 SAYDCon01_    R    1    4     4G          1:39    2-02:58:21  en-7-6
     42269  u111125 freqdft33f   PD    1   16     5G  2024-03-12T1    7-00:00:00  (AssocGrpBillingMinutes)
     42270  u111125 freqdft111   PD    1   16     5G  2024-03-12T1    7-00:00:00  (AssocGrpBillingMinutes)

If your job is in pd or pending status and you see AssocGrpBillingMinutes in the last column of the sq command, it means that your credit is low and you need to top up your account. Until your credit is secured, the job will remain in this status.

Important

If you see the Invalid value on the TIME_LEFT column, it means that the time you specified for your job has expired and your job may be terminated after some extra time. To solve this problem, you can use the update-job-time command and increase the execution time of your program.

Currently, on our partitions, in addition to having a default time limit, there is also an OverTimeLimit value. For example, in the Short partition, the maximum time is 30 minutes, and your job can continue for up to 10 minutes after that, but when time is reached, it will be terminated.

The OverTimeLimit value in other partitions is one day. This means that after the job time is over, that job will continue for another day and by the time it is over, the job will be stopped.

Info

The value of overtime may change according to NHPCC policies. To view the current value, use the same scontrol show partition command.

If you have several jobs running in different paths, with the cdw <jobID> command, you will enter the path from where your job was submitted, and if you press the cdw command without any options, it will take you to the path of the last job you submitted.

When your job is running locally, which means you have moved your file to /tmp and it is running on that location, if you type the command cdtmp <JobID>, it will take you to the node on which your job is running. Now, if you want to return to the previous path, use the exit command.

Using GPUs#

If your program needs a graphics card to run, you must specify the number of cards using the -G option when running an interactive job.
If you have written your job file, you specify how many GPUs your job needs with the --gres option. Please refer to the sample files.

Canceling jobs#

In order to stop one ore more jobs, use scancel JobID command:

u111111@login1:wrkdir> scancel 42270,42281

If you want to cancel all your jobs, enter:

u111111@login1:wrkdir> scancel -u $USER

Job Script Builder#

For most of the software installed in Panthera, there is a pre-written script that takes your requested resource and creates a job file based on it and submits it. These scripts usually start with the sub prefix. The guide for each script is explained in the corresponding software section. If you run these scripts with the -no option, it will create only a job file, which you can edit and then run as you wish.

Job arrays#

Sometimes we want to run a program with different inputs, or in other words, we have a large number of runs that can be run separately and independently from each other. In the slurm program, the concept of array jobs (Array Jobs) is used for this purpose and this feature can be used with the command --array in the job script file or as one of the options of the sbatch command.

Job index includes arrays of positive integers and is usually defined in three ways:

As a series of numbers by specifying the beginning and the end
As a comma-separated list (like ",")
The series mode by specifying the step size

Example

# Submitting a job array with index values between 0 and 31
$ sbatch --array=0-31 -N1 job

# Submitting a job array with index values of 1, 3, 5 and 7
$ sbatch --array=1,3,5,7 -N1 job

# Submitting a job array with index values between 1 and 7 with a step size of 2 i.e. 1, 3, 5 and 7
$ sbatch --array=1-7:2 -N1 job

The SLURM_ARRAY_TASK_ID variable, which represents the index value of the array, is the main variable in an array job, and program entries must be defined based on this variable.

Example1#

Let's suppose we have file "hello.sh" :

u111112@login1:~/array_job> cat hello.sh 
!/bin/bash
#SBATCH --job-name=Hello
#SBATCH --output=%x_%A_%a.out
#SBATCH --array=0-5
#SBATCH --time=00:15:00
#SBATCH --mem=200
# You may put the commands below:
# Job step
srun echo "I am array task number" $SLURM_ARRAY_TASK_ID

As you can see in the figure below, by sending this file as a job, 6 jobs will be created and executed:

u111112@login1:~/array_job> sbatch hello.sh 
Submitted batch job 47668
u111112@login1:~/array_job> sq
     JOBID PARTITION     NAME  ST NODE CPUS MEMORY (SUBMIT_)TIME     TIME_LEFT  NODELIST(REASON)
47668_0        amd48    Hello   R    1    1   200M      18:17:01         15:00  cn-11-1
47668_1        amd48    Hello   R    1    1   200M      18:17:01         15:00  cn-11-1
47668_2        amd48    Hello   R    1    1   200M      18:17:01         15:00  cn-11-1
47668_3        amd48    Hello   R    1    1   200M      18:17:01         15:00  cn-11-1
47668_4        amd48    Hello   R    1    1   200M      18:17:01         15:00  cn-11-1

The number 47668 is the main job number and indexes 0 to 5 are related to each array.

Slurm commands can be applied to one or more indexes or to the entire job, for example:

Example

# Cancel all elements
$ scancel 47668

# Cancel array ID 4 and 5
$ scancel 47668_4 47668_5

# Cancel array ID 0 to 3
$ scancel 47668_[0-3]

In general, this job creates 6 outputs:

u111112@login1:~/array_job> ls
Hello_47668_0.out  Hello_47668_2.out  Hello_47668_4.out  config.txt  hello.sh  job2        pi.py   test.py
Hello_47668_1.out  Hello_47668_3.out  Hello_47668_5.out  g09         job1      output.txt  test.R
u111112@login1:~/array_job> cat Hello_47668_5.out
I am array task number 5

The naming of the output files is specified with the --output command in the answer file. In this instruction, %x represents the program name (which is defined by the --job-name command), %A represents the job number, and %a represents the index number. If the output command is not specified, the default name of the output files is slurm-%A_%a.out.

Note

If the output command does not contain the % statement and the --open-mode=append is added at the end, only one output file will be created:

u111112@login1:~/array_job> cat hello.sh 
!/bin/bash
#SBATCH --job-name=Hello
#SBATCH --output=output.txt --open-mode=append
#SBATCH --array=0-5
#SBATCH --time=00:15:00
#SBATCH --mem=200
# You may put the commands below:

# Job step
srun echo "I am array task number" $SLURM_ARRAY_TASK_ID
u111112@login1:~/array_job> sbatch hello.sh 
Submitted batch job 47675
u111112@login1:~/array_job> cat output.txt
I am array task number 1
I am array task number 0
I am array task number 2
I am array task number 3
I am array task number 4
I am array task number 5

Example2#

Now we want to run a program with multiple inputs. In this case, input file names should be defined based on array variables.

#!/bin/bash
#SBATCH -J Gaussian
#SBATCH -o g09.out --open-mode=append
#SBATCH -n 1
#SBATCH -c 4
#SBATCH -a 1-20
#SBATCH --mem=8G
#SBATCH --time=30
module load gaussian/g09D1
### Input files are named test01.com, test02.com, ..., test20.com
### Zero pad the task ID to match the numbering of the input files
n=$(printf "%02d" $SLURM_ARRAY_TASK_ID)
# Run Gaussian
g09 test${n}.com

The above script has been prepared to run Gossin program with 20 input files, which file names include numbers 01, 02, ..., 20. Therefore, the n variable is defined in such a way that the 0 character is added to indices smaller than 10. Note that in this example, for the execution of each entry, 4 cores and 8GB of RAM memory are allocated.

Example3#

If the input filenems are not related to the index of the array, such as:

u111112@login1:~/array_job/g09> ls
Co-CO2-opt.gjf    Co-H2-CHOHO-TS.gjf  Fe-CH2OOH-opt.gjf   H2-CO2-Co-ts.gjf    Ni-H-CO2-opt-freq.gjf  Ni-H2-CHOHO-opt.gjf  config.txt
Co-H-OCHO-ts.gjf  Co-OH2CO-opt.gjf    Fe-ts-nics-opt.gjf  Ni-H-CO2-opt-2.gjf  Ni-H-CO2-opt.gjf       b3lyp-nics0.gjf      g09.sh
u111112@login1:~/array_job/g09>

First create a file where each line contains one of those input files:

u111112@login1:~/array_job/g09> ls *.gjf > config.txt
u111112@login1:~/array_job/g09> cat config.txt
Co-CO2-opt.gjf
Co-H-OCHO-ts.gjf
Co-H2-CHOHO-TS.gjf
Co-OH2CO-opt.gjf
Fe-CH2OOH-opt.gjf
Fe-ts-nics-opt.gjf
H2-CO2-Co-ts.gjf
Ni-H-CO2-opt-2.gjf
Ni-H-CO2-opt-freq.gjf
Ni-H-CO2-opt.gjf
Ni-H2-CHOHO-opt.gjf
b3lyp-nics0.gjf

And as follows, for each index, read the input filename from the config file:

#SBATCH -J Gaussian
.
.
.
n=$SLURM_ARRAY_TASK_ID
input=$(sed -n "$n p" config.txt)
# Run Gaussian
g09 $input

If the desired program requires two or more parameters as input, the best way is to create a configuration file, each line of which contains input parameters for each index. For example, the foo program gets two parameters, name and age, as input arguments:

u111112@login1:~/array_job> cat config.txt
ArrayTaskID   SampleName        Age
1             Bobby             12
2             Ben               20
3             Amelia            35
4             George            18
5             Arthur            50
6             Betty             70
7             Julia             63
8             Fred              85
9             Steve             10
10            Emily             43

With the awk command, you can get the desired parameters from the config.txt file according to the array index:

# Specify the path to the config file
config=/path/to/config.txt

# Extract the sample name
name=$(awk -v id=$SLURM_ARRAY_TASK_ID '$1==id {print $2}' $config)

# Extract the age
age=$(awk -v id=$SLURM_ARRAY_TASK_ID '$1==id {print $3}' $config)

# Run program
foo -a $name -b $age

Monitoring jobs#

There are two ways to watch your job(s) status:

using my_jobs command:

  JobID  Partition NNo Ncpu     ReqMem      Elapsed     State         JobName 
------- ---------- --- ---- ---------- ------------ --------- --------------- 
  31866      short   1    1         5G     00:00:47 CANCELLE+ sys/dashboard/+ 
  31867      amd48   1    1         1G     00:02:02 CANCELLE+ sys/dashboard/+ 
  31959      short   1    1         6G     00:15:36 CANCELLE+ sys/dashboard/+ 
  31960      short   1    1        16G     00:27:29 COMPLETED            bash 
  31961      amd48   1    1        16G     00:04:11    FAILED            bash 
  31962      short   1    1        16G     00:02:07 COMPLETED            bash 
  31968      short   1    1         6G     00:40:25   TIMEOUT sys/dashboard/+ 
  42069      short   1    1        16G     00:15:52 COMPLETED            bash 
  42070      amd48   1    1        16G     00:17:04 COMPLETED            bash 
  42071      short   1    1         4G     00:01:20 CANCELLE+ sys/dashboard/+ 
  42075     amd128   1    1         6G     00:04:03 CANCELLE+ sys/dashboard/+ 
  42076     amd128   1    1        16G     00:00:40    FAILED            bash 
  42079      amd48   1    1         5G     00:06:54 CANCELLE+ sys/dashboard/+ 
  42080        gpu   1    1         2G     00:17:53 CANCELLE+ sys/dashboard/+ 
  42084      short   1    1         4G     00:40:11   TIMEOUT sys/dashboard/+ 
  42085      amd48   1    1         3G     01:40:45 CANCELLE+ sys/dashboard/+ 
  42087      short   1    1         8G     00:40:27   TIMEOUT sys/dashboard/+ 
  42089     amd128   1    1        16G     01:38:28 COMPLETED            bash 
  42090     amd128   1    1         5G     00:02:09 CANCELLE+ sys/dashboard/+

watching live the file.JobID created under your home directory

u111111@login1:~> ls
1.wbpj   Desktop    Downloads   Music     Public  Templates        Videos  jogl.ex.55244  my.sh     simulink.err  tests
1_files  Documents  JobSummary  Pictures  R       Untitled1.ipynb  bin     mattt.log      ondemand  slprj         wrkdir
u111111@login1:~> tail -f jogl.ex.55244
com.jogamp.opengl.GLException: Profile GL_DEFAULT is not available on X11GraphicsDevice[type .x11, connection :1, unitID 0, handle 0x0, owner false, ResourceToolkitLock[obj 0x63a6d25d, isOwner false, <7f18168b, 137a0cab>[count 0, qsz 0, owner <NULL>]]], but: []
        at com.jogamp.opengl.GLProfile.get(GLProfile.java:990)
        at com.jogamp.opengl.GLProfile.getDefault(GLProfile.java:721)
        at com.jogamp.opengl.GLCapabilities.<init>(GLCapabilities.java:84)
        at com.mathworks.hg.uij.OpenGLUtils$MyGLListener.getGLInformation(OpenGLUtils.java:332)
        at com.mathworks.hg.uij.OpenGLUtils$MyGLListener.getGLData(OpenGLUtils.java:512)
        at com.mathworks.hg.uij.OpenGLUtils.getGLData(OpenGLUtils.java:79)

Job statistics#

After the job is finished, in addition to the output file which is created next to your input files, another file called JobID.out is created under /home/your_username/JobSummary directory, which shows the statistics of your job.

u111111@login1:~> cat JobSummary/16858.out
Job ID              = 16858
State               = CANCELLED (exit code 0)
Cores               = 1
CPU Utilized        = 00:00:10
CPU Efficiency      = 1.69% of 00:09:50 core-walltime
Job Wall-clock time = 00:09:50
Memory Utilized     = 288.21 MB
Memory Efficiency   = 14.07% of 2.00 GB
WorkDir             = /home/u111111/ondemand/data/sys/dashboard/batch_connect/sys/iut-ood-jupyter/output/ea3df8dd-0e19-47fd-9002-d6bde52fab71
SubmitTime          = 2024-01-14T09:13:52
StartTime           = 2024-01-14T09:13:52
EndTime             = 2024-01-14T09:23:42
ElapsedTime         = 00:09:50
CPUTime             = 00:09:50
NodeList            = master-tyan
Partition           = short
AllocTRES           = cpu=1,mem=2048M,node=1
Billing             = 29 Toman

As you can see, important things like the following are registered for the job:

Where did it executed from?
When was it submitted?
When was it run?
how long did it take
How much RAM is used?
What percentage of cpu is used?
And how much is the total cost of this job that is deducted from your charge.

Tip

By observing these files and examining the resource consumption of each job, you are able to estimate the optimal amount of your next resource requests.