Can't Find What You Need?


Have a Suggestion?


Browse by Category


Skip to end of metadata
Go to start of metadata

Introduction 

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. As a cluster workload manager, Slurm has three key functions. 

  • First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. 
  • Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. 
  • Finally, it arbitrates contention for resources by managing a queue of pending work. 

Slurm Architecture

Slurm has a centralized manager, slurmctld, to monitor resources and work. Slurm consists of a slurmd daemon running on each compute node and a central slurmctld daemon running on a management node (with optional fail-over twin). The slurmd daemons provide fault-tolerant hierarchical communications. The user commands include: sacct, sacctmgr, salloc, sattach, sbatch, sbcast, scancel, scontrol, scrontab, sdiag, sh5util, sinfo, sprio, squeue, sreport, srun, sshare, sstat, strigger and sview. All of the commands can run anywhere in the cluster.

User Tools 

srun to initiate jobs

scancel to terminate queued or running jobs

sinfo to report system status

squeue to report the status of jobs

sacct to get information about jobs and job steps that are running or have completed

sview commands graphically reports system and job status including network topology

scontrol available to monitor and/or modify configuration and state information on the cluster.

sacctmgr is the administrative tool used to manage the database. It can be used to identify the clusters, valid users, valid bank accounts, etc. 

More information on user tools : https://slurm.schedmd.com/quickstart.html 


Slurm Partitions and Nodes

Entities managed by Slurm daemons are nodes, partitions, jobs and job steps.

Nodes - The compute resource in Slurm

Partitions - Groups nodes into logical (possibly overlapping) sets. The partitions can be considered job queues, each of which has an assortment of constraints such as job size limit, job time limit, users permitted to use it, etc. Priority-ordered jobs are allocated nodes within a partition until the resources (nodes, processors, memory, etc.) within that partition are exhausted. 

Jobs - Allocations of resources assigned to a user for a specified amount of time 

Job Steps - These are sets of (possibly parallel) tasks within a job

Once a job is assigned a set of nodes, the user is able to initiate parallel work in the form of job steps in any configuration within the allocation. 

More information 

https://slurm.schedmd.com/overview.html 



Submit a job in Slurm 

Please go through this link which has details on how to submit a sample job to Slurm. How do I Get Started with High Performance Computing and Create my first job? 

Delete a slurm job

Please have the job id which you want to delete. If you are unsure of the job id, please use the below command to get the list of jobs under your user. 

squeue -u <your_username>

If you are willing to delete a slurm job for some unexpected reasons, please use below commands:

scancel <job_id>




Parallel Job Submission in Slurm

For parallel job submission in Slurm, please refer this link  Can I run parallel jobs on the cluster?



Related FAQs

Page viewed times