Can't Find What You Need?
Have a Suggestion?
Browse by Category
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. As a cluster workload manager, Slurm has three key functions.
Slurm has a centralized manager, slurmctld, to monitor resources and work. Slurm consists of a slurmd daemon running on each compute node and a central slurmctld daemon running on a management node (with optional fail-over twin). The slurmd daemons provide fault-tolerant hierarchical communications. The user commands include: sacct, sacctmgr, salloc, sattach, sbatch, sbcast, scancel, scontrol, scrontab, sdiag, sh5util, sinfo, sprio, squeue, sreport, srun, sshare, sstat, strigger and sview. All of the commands can run anywhere in the cluster.
srun to initiate jobs
scancel to terminate queued or running jobs
sinfo to report system status
squeue to report the status of jobs
sacct to get information about jobs and job steps that are running or have completed
sview commands graphically reports system and job status including network topology
scontrol available to monitor and/or modify configuration and state information on the cluster.
sacctmgr is the administrative tool used to manage the database. It can be used to identify the clusters, valid users, valid bank accounts, etc.
More information on user tools : https://slurm.schedmd.com/quickstart.html
Slurm Partitions and Nodes
Entities managed by Slurm daemons are nodes, partitions, jobs and job steps.
Nodes - The compute resource in Slurm
Partitions - Groups nodes into logical (possibly overlapping) sets. The partitions can be considered job queues, each of which has an assortment of constraints such as job size limit, job time limit, users permitted to use it, etc. Priority-ordered jobs are allocated nodes within a partition until the resources (nodes, processors, memory, etc.) within that partition are exhausted.
Jobs - Allocations of resources assigned to a user for a specified amount of time
Job Steps - These are sets of (possibly parallel) tasks within a job
Once a job is assigned a set of nodes, the user is able to initiate parallel work in the form of job steps in any configuration within the allocation.
Please go through this link which has details on how to submit a sample job to Slurm. How do I Get Started with High Performance Computing and Create my first job?
Delete a slurm job
Please have the job id which you want to delete. If you are unsure of the job id, please use the below command to get the list of jobs under your user.
squeue -u <your_username>
If you are willing to delete a slurm job for some unexpected reasons, please use below commands:
Parallel Job Submission in Slurm
For parallel job submission in Slurm, please refer this link Can I run parallel jobs on the cluster?