Using the cluster
Panthera overview#
Since all users have access to the cluster, they need both a resource manager and a queuing system, which we have used the slurm program in our cluster.
The basis of resource allocation in a cluster is based on the queuing mechanism. If there are free resources available, your job will be run, and if the resources are not available at that moment, your job will still wait in the queue.
Tip
The base operating system of our cluster, like most ones, is Linux. As a rule, the software you need to work with, should have a Linux version, so you can run it on the cluster.
Info
for more information about Panthera architecture and its hardware specification, please visit this page.
Job types on Panthera#
Before we describe the types of jobs, it is necessary to say what kind of applications a cluster can be useful for:
- Software written in parallel so that you can use several cores of a single compute node or multiple nodes (depending on the mass of calculations) that increase the total speed of your calculations.
- If you don't have enough resources in your personal system.
- Your program is not parallel, but you need to have many executions (with various inputs). In this case, you can send and execute several jobs to the cluster at the same time.
Generally, there are two strategies for running programs in parallel:
- shared memory
- distributed memory
Both of these are known as OpenMP or MPI. In OpenMP, you can finally use the cores of one system, and your program does not have the ability to use the processors of several systems at the same time. In distributed memory mode, you can engage the processors of several cores and use them.
Note
The number of more cores does not necessarily mean the higher program execution speed.
Parallel programs always have an optimal limit of the number of received cores, and to some extent you face a linear speed increase, and after that, the speed of your executions decreases, and maybe with the increase in the number of cores, the speed of your executions becomes slower.
In general, jobs are executed in two ways in the cluster:
- interactive: since you may need to compile a code before running your program, or you may want to download some libraries you need from the Internet, or you may need an initial test of your code, you can use this type of jobs.
- non interactive
It is important to mention that the resources you request through interactive jobs must be available to you at the same moment, and since the cluster resources are limited, that resources will be allocated to you with a smaller number and less time.
Info
Interactive jobs are not suitable for long-term executions. And you should prepare and send your job in the form of non interactive via batch scripts.
Partitions#
Usually, there are a series of pre defined groups know as partitions on clusters, each of which consists of a series of nodes. to see the list of partitions use sinfo
command.
u111111@login1:~> sinfo -s
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
allnodes up infinite 8/18/19/45 cn-2-[1-9],cn-11-[1-8],cn-12-[1-8],cn-13-[1-9],en-1-[1-4],en-7-[1-2,5-9]
short up 30:00 0/1/0/1 en-7-5
gpu up 10-00:00:0 0/2/0/2 en-7-[1-2]
amd128 up 30-00:00:0 4/3/1/8 en-1-[1-4],en-7-[6-9]
amd48* up 60-00:00:0 4/12/9/25 cn-11-[1-8],cn-12-[1-8],cn-13-[1-9]
Info
Partitions marked with *
are default ones.
If you want the see details of a specific partition use this command:
u111111@login1:~> scontrol show partition amd128
PartitionName=amd128
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=2 MaxTime=30-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=en-1-[1-4],en-7-[6-9]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=512 TotalNodes=8 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=819200
TRES=cpu=512,mem=8255024M,node=8,billing=334006
TRESBillingWeights=CPU=180,Mem=30G
Here are a brief introduction of each partition
- short: which has a time limit of 30 minutes and is connected to the Internet.
If you are using a program that requires internet access (for example pytorch) and you want to install a package for it, you must first create an interactive job on this partition and download and install your package.
-
gpu: this partition is useful for jobs need to use our graphics cards. There are currently two nodes in this partition.
-
amd128: This partition contains our epyc systems, which have 64 real cores (or 128 threads) and 1 TB of memory.
-
amd48: This partition contains our old 48 cores with 96GB of memory
Info
On Panthera, as much software as possible is installed on both amd48
and amd128
partitions. But some software such as Gaussian16 are only installed on amd128
partition.
CPU types#
Partition | CPU type |
---|---|
short | thread |
gpu | thread |
amd128 | real |
amd48 | real |
Warning
Each partition has a time limit. Every job you send must be based on that time limit. And definitely, because they are limited, your program should be finished in that time frame and if it takes longer than that and is interrupted, it should be able to continue its work after re-execution.
Submitting jobs#
Before submitting a job, you should pay attention to these points:
- Linux commands are case sensitive.
- Do not use special characters like
({[&#$@
for naming files and directories. - Put your input files on
wrkdir
directory and run the command from that location (relative path). -
If you have written your input file in Windows, after uploading it in the cluster, be sure to run the
dos2unix
command on it.u111111@login1:wrkdir> dos2unix my_input_file
-
If you use the Windows platform to write your files, try to use the Notepad++ software as much as possible.
- When you want to prepare your job script, you must define the resources (like number of CPUs, RAM capacity, ...) you need according to the
Slurm
instructions (please refer to these examples). And then the module related to your software should be loaded and the execution command of your program should be written at the end. - If your program produces a lot of scratch files, try moving your input files to the
/tmp
folder. In such a way that you first specify that the files are transferred to the/tmp
path, the execution commands are written there, and then the results are returned to yourwrkdir
directory.
Once your script is ready, submit your job using the sbatch
command.
u111111@login1:wrkdir> sbatch your_script
sq
command. If the output does not show you anything, it means that your job is finished (successfully or failed).
u111111@login1:wrkdir> sq
JOBID USER NAME ST NODE CPUS MEMORY (SUBMIT_)TIME TIME_LEFT NODELIST(REASON)
43807 u111183 P1-methylr R 1 8 16G 9-05:52:24 2-18:07:36 cn-11-1
43904 u111142 Co-CO2-opt R 1 16 33024M 6-04:06:39 19:53:21 cn-11-2
46884 u111142 Ni-H2-CHOH R 1 16 33024M 2-01:55:35 22:04:25 cn-11-4
46948 u111154 wna61 R 1 16 11G 1-03:47:12 8-20:12:48 en-7-8
46949 u111154 wnah5 R 1 16 11G 1-03:46:44 1-20:13:16 en-7-9
46971 u111183 P2-methylr R 1 16 6G 21:59:29 11-02:00:31 cn-11-2
47657 u111176 comp R 1 16 50G 9:37:12 1-14:22:48 en-7-6
47665 u111132 Pd13_ffreq R 1 16 6G 5:44:27 1-18:15:33 en-7-8
47686 u111186 GA_Procedu R 1 64 100G 1:38:31 1-22:21:29 en-1-1
47691 u111175 SAYDCon4_R R 1 4 4G 43:48 2-01:16:12 en-7-6
47692 u111142 Fe-ts-nics R 1 16 33024M 9:05 23:50:55 cn-11-1
47693 u111175 SAYDCon01_ R 1 4 4G 1:39 2-02:58:21 en-7-6
42269 u111125 freqdft33f PD 1 16 5G 2024-03-12T1 7-00:00:00 (AssocGrpBillingMinutes)
42270 u111125 freqdft111 PD 1 16 5G 2024-03-12T1 7-00:00:00 (AssocGrpBillingMinutes)
If your job is in pd
or pending
status and you see AssocGrpBillingMinutes
in the last column of the sq command, it means that your credit is low and you need to top up your account. Until your credit is secured, the job will remain in this status.
Important
If you see the Invalid value on the TIME_LEFT
column, it means that the time you specified for your job has expired and your job may be terminated after some extra time. To solve this problem, you can use the update-job-time
command and increase the execution time of your program.
Currently, on our partitions, in addition to having a default time limit, there is also an OverTimeLimit
value. For example, in the Short partition, the maximum time is 30 minutes, and your job can continue for up to 10 minutes after that, but when time is reached, it will be terminated.
The OverTimeLimit
value in other partitions is one day. This means that after the job time is over, that job will continue for another day and by the time it is over, the job will be stopped.
Info
The value of overtime may change according to NHPCC policies. To view the current value, use the same scontrol show partition command.
If you have several jobs running in different paths, with the cdw <jobID>
command, you will enter the path from where your job was submitted, and if you press the cdw command without any options, it will take you to the path of the last job you submitted.
When your job is running locally, which means you have moved your file to /tmp and it is running on that location, if you type the command cdtmp <JobID>
, it will take you to the node on which your job is running. Now, if you want to return to the previous path, use the exit
command.
Using GPUs#
- If your program needs a graphics card to run, you must specify the number of cards using the
-G
option when running an interactive job. - If you have written your job file, you specify how many GPUs your job needs with the
--gres
option. Please refer to the sample files.
Canceling jobs#
In order to stop one ore more jobs, use scancel JobID
command:
u111111@login1:wrkdir> scancel 42270,42281
If you want to cancel all your jobs, enter:
u111111@login1:wrkdir> scancel -u $USER
Job Script Builder#
For most of the software installed in Panthera, there is a pre-written script that takes your requested resource and creates a job file based on it and submits it. These scripts usually start with the sub prefix. The guide for each script is explained in the corresponding software section. If you run these scripts with the -no option, it will create only a job file, which you can edit and then run as you wish.
Job arrays#
Sometimes we want to run a program with different inputs, or in other words, we have a large number of runs that can be run separately and independently from each other. In the slurm program, the concept of array jobs (Array Jobs) is used for this purpose and this feature can be used with the command --array in the job script file or as one of the options of the sbatch command.
Job index includes arrays of positive integers and is usually defined in three ways:
- As a series of numbers by specifying the beginning and the end
- As a comma-separated list (like ",")
- The series mode by specifying the step size
Example
# Submitting a job array with index values between 0 and 31
$ sbatch --array=0-31 -N1 job
# Submitting a job array with index values of 1, 3, 5 and 7
$ sbatch --array=1,3,5,7 -N1 job
# Submitting a job array with index values between 1 and 7 with a step size of 2 i.e. 1, 3, 5 and 7
$ sbatch --array=1-7:2 -N1 job
The SLURM_ARRAY_TASK_ID variable, which represents the index value of the array, is the main variable in an array job, and program entries must be defined based on this variable.
Example1#
Let's suppose we have file "hello.sh" :
u111112@login1:~/array_job> cat hello.sh
!/bin/bash
#SBATCH --job-name=Hello
#SBATCH --output=%x_%A_%a.out
#SBATCH --array=0-5
#SBATCH --time=00:15:00
#SBATCH --mem=200
# You may put the commands below:
# Job step
srun echo "I am array task number" $SLURM_ARRAY_TASK_ID
As you can see in the figure below, by sending this file as a job, 6 jobs will be created and executed:
u111112@login1:~/array_job> sbatch hello.sh
Submitted batch job 47668
u111112@login1:~/array_job> sq
JOBID PARTITION NAME ST NODE CPUS MEMORY (SUBMIT_)TIME TIME_LEFT NODELIST(REASON)
47668_0 amd48 Hello R 1 1 200M 18:17:01 15:00 cn-11-1
47668_1 amd48 Hello R 1 1 200M 18:17:01 15:00 cn-11-1
47668_2 amd48 Hello R 1 1 200M 18:17:01 15:00 cn-11-1
47668_3 amd48 Hello R 1 1 200M 18:17:01 15:00 cn-11-1
47668_4 amd48 Hello R 1 1 200M 18:17:01 15:00 cn-11-1
Slurm commands can be applied to one or more indexes or to the entire job, for example:
Example
# Cancel all elements
$ scancel 47668
# Cancel array ID 4 and 5
$ scancel 47668_4 47668_5
# Cancel array ID 0 to 3
$ scancel 47668_[0-3]
In general, this job creates 6 outputs:
u111112@login1:~/array_job> ls
Hello_47668_0.out Hello_47668_2.out Hello_47668_4.out config.txt hello.sh job2 pi.py test.py
Hello_47668_1.out Hello_47668_3.out Hello_47668_5.out g09 job1 output.txt test.R
u111112@login1:~/array_job> cat Hello_47668_5.out
I am array task number 5
Note
If the output command does not contain the % statement and the --open-mode=append is added at the end, only one output file will be created:
u111112@login1:~/array_job> cat hello.sh
!/bin/bash
#SBATCH --job-name=Hello
#SBATCH --output=output.txt --open-mode=append
#SBATCH --array=0-5
#SBATCH --time=00:15:00
#SBATCH --mem=200
# You may put the commands below:
# Job step
srun echo "I am array task number" $SLURM_ARRAY_TASK_ID
u111112@login1:~/array_job> sbatch hello.sh
Submitted batch job 47675
u111112@login1:~/array_job> cat output.txt
I am array task number 1
I am array task number 0
I am array task number 2
I am array task number 3
I am array task number 4
I am array task number 5
Example2#
Now we want to run a program with multiple inputs. In this case, input file names should be defined based on array variables.
#!/bin/bash
#SBATCH -J Gaussian
#SBATCH -o g09.out --open-mode=append
#SBATCH -n 1
#SBATCH -c 4
#SBATCH -a 1-20
#SBATCH --mem=8G
#SBATCH --time=30
module load gaussian/g09D1
### Input files are named test01.com, test02.com, ..., test20.com
### Zero pad the task ID to match the numbering of the input files
n=$(printf "%02d" $SLURM_ARRAY_TASK_ID)
# Run Gaussian
g09 test${n}.com
Example3#
If the input filenems are not related to the index of the array, such as:
u111112@login1:~/array_job/g09> ls
Co-CO2-opt.gjf Co-H2-CHOHO-TS.gjf Fe-CH2OOH-opt.gjf H2-CO2-Co-ts.gjf Ni-H-CO2-opt-freq.gjf Ni-H2-CHOHO-opt.gjf config.txt
Co-H-OCHO-ts.gjf Co-OH2CO-opt.gjf Fe-ts-nics-opt.gjf Ni-H-CO2-opt-2.gjf Ni-H-CO2-opt.gjf b3lyp-nics0.gjf g09.sh
u111112@login1:~/array_job/g09>
u111112@login1:~/array_job/g09> ls *.gjf > config.txt
u111112@login1:~/array_job/g09> cat config.txt
Co-CO2-opt.gjf
Co-H-OCHO-ts.gjf
Co-H2-CHOHO-TS.gjf
Co-OH2CO-opt.gjf
Fe-CH2OOH-opt.gjf
Fe-ts-nics-opt.gjf
H2-CO2-Co-ts.gjf
Ni-H-CO2-opt-2.gjf
Ni-H-CO2-opt-freq.gjf
Ni-H-CO2-opt.gjf
Ni-H2-CHOHO-opt.gjf
b3lyp-nics0.gjf
#SBATCH -J Gaussian
.
.
.
n=$SLURM_ARRAY_TASK_ID
input=$(sed -n "$n p" config.txt)
# Run Gaussian
g09 $input
u111112@login1:~/array_job> cat config.txt
ArrayTaskID SampleName Age
1 Bobby 12
2 Ben 20
3 Amelia 35
4 George 18
5 Arthur 50
6 Betty 70
7 Julia 63
8 Fred 85
9 Steve 10
10 Emily 43
With the awk command, you can get the desired parameters from the config.txt file according to the array index:
# Specify the path to the config file
config=/path/to/config.txt
# Extract the sample name
name=$(awk -v id=$SLURM_ARRAY_TASK_ID '$1==id {print $2}' $config)
# Extract the age
age=$(awk -v id=$SLURM_ARRAY_TASK_ID '$1==id {print $3}' $config)
# Run program
foo -a $name -b $age
Monitoring jobs#
There are two ways to watch your job(s) status:
- using
my_jobs
command:
JobID Partition NNo Ncpu ReqMem Elapsed State JobName
------- ---------- --- ---- ---------- ------------ --------- ---------------
31866 short 1 1 5G 00:00:47 CANCELLE+ sys/dashboard/+
31867 amd48 1 1 1G 00:02:02 CANCELLE+ sys/dashboard/+
31959 short 1 1 6G 00:15:36 CANCELLE+ sys/dashboard/+
31960 short 1 1 16G 00:27:29 COMPLETED bash
31961 amd48 1 1 16G 00:04:11 FAILED bash
31962 short 1 1 16G 00:02:07 COMPLETED bash
31968 short 1 1 6G 00:40:25 TIMEOUT sys/dashboard/+
42069 short 1 1 16G 00:15:52 COMPLETED bash
42070 amd48 1 1 16G 00:17:04 COMPLETED bash
42071 short 1 1 4G 00:01:20 CANCELLE+ sys/dashboard/+
42075 amd128 1 1 6G 00:04:03 CANCELLE+ sys/dashboard/+
42076 amd128 1 1 16G 00:00:40 FAILED bash
42079 amd48 1 1 5G 00:06:54 CANCELLE+ sys/dashboard/+
42080 gpu 1 1 2G 00:17:53 CANCELLE+ sys/dashboard/+
42084 short 1 1 4G 00:40:11 TIMEOUT sys/dashboard/+
42085 amd48 1 1 3G 01:40:45 CANCELLE+ sys/dashboard/+
42087 short 1 1 8G 00:40:27 TIMEOUT sys/dashboard/+
42089 amd128 1 1 16G 01:38:28 COMPLETED bash
42090 amd128 1 1 5G 00:02:09 CANCELLE+ sys/dashboard/+
- watching live the
file.JobID
created under your home directory
u111111@login1:~> ls
1.wbpj Desktop Downloads Music Public Templates Videos jogl.ex.55244 my.sh simulink.err tests
1_files Documents JobSummary Pictures R Untitled1.ipynb bin mattt.log ondemand slprj wrkdir
u111111@login1:~> tail -f jogl.ex.55244
com.jogamp.opengl.GLException: Profile GL_DEFAULT is not available on X11GraphicsDevice[type .x11, connection :1, unitID 0, handle 0x0, owner false, ResourceToolkitLock[obj 0x63a6d25d, isOwner false, <7f18168b, 137a0cab>[count 0, qsz 0, owner <NULL>]]], but: []
at com.jogamp.opengl.GLProfile.get(GLProfile.java:990)
at com.jogamp.opengl.GLProfile.getDefault(GLProfile.java:721)
at com.jogamp.opengl.GLCapabilities.<init>(GLCapabilities.java:84)
at com.mathworks.hg.uij.OpenGLUtils$MyGLListener.getGLInformation(OpenGLUtils.java:332)
at com.mathworks.hg.uij.OpenGLUtils$MyGLListener.getGLData(OpenGLUtils.java:512)
at com.mathworks.hg.uij.OpenGLUtils.getGLData(OpenGLUtils.java:79)
Job statistics#
After the job is finished, in addition to the output file which is created next to your input files, another file called JobID.out is created under /home/your_username/JobSummary
directory, which shows the statistics of your job.
u111111@login1:~> cat JobSummary/16858.out
Job ID = 16858
State = CANCELLED (exit code 0)
Cores = 1
CPU Utilized = 00:00:10
CPU Efficiency = 1.69% of 00:09:50 core-walltime
Job Wall-clock time = 00:09:50
Memory Utilized = 288.21 MB
Memory Efficiency = 14.07% of 2.00 GB
WorkDir = /home/u111111/ondemand/data/sys/dashboard/batch_connect/sys/iut-ood-jupyter/output/ea3df8dd-0e19-47fd-9002-d6bde52fab71
SubmitTime = 2024-01-14T09:13:52
StartTime = 2024-01-14T09:13:52
EndTime = 2024-01-14T09:23:42
ElapsedTime = 00:09:50
CPUTime = 00:09:50
NodeList = master-tyan
Partition = short
AllocTRES = cpu=1,mem=2048M,node=1
Billing = 29 Toman
As you can see, important things like the following are registered for the job:
- Where did it executed from?
- When was it submitted?
- When was it run?
- how long did it take
- How much RAM is used?
- What percentage of cpu is used?
- And how much is the total cost of this job that is deducted from your charge.
Tip
By observing these files and examining the resource consumption of each job, you are able to estimate the optimal amount of your next resource requests.