Slurm Source Code Install | Cluster Deployment - Day4： Methods of Slurm to restrict resources

Support for Multi-core/Multi-thread Architectures (srun to control resources usage)
Consumable Resources in Slurm
Resource Binding
CPU Management User and Administrator Guide
GRES
Heterogeneous Job Support
Containers Guide

Support for Multi-core/Multi-thread Architectures (srun to control resources usage)

some key parameters in srun

BaseBoard
LDom
Socket/Core/Thread
CPU
Affinity/Affinity Mask/Fat Masks

The use must use srun to

--cpu-bind=... --sockets-per-node=S --cores-per-socket=C --threads-per-core=T

and ohter advanced methods for uses to speicify the number of nodes binded with a job

test this plugin in slurm

map_cpu:[list] specify a CPU ID binding for each task where [list] is [cpuid1],[cpuid2],...[cpuidN]

The CPU IDs within a node in the block numbering are: (this information is available from the /proc/cpuinfo file on the system)

map_ldom:[list] specify a NUMA locality domain ID for each task where [list] is [ldom1],[ldom2],...[ldomN]
boards auto-generated masks bind to boards ldoms auto-generated masks bind to NUMA locality domains sockets auto-generated masks bind to sockets cores auto-generated masks bind to cores threads auto-generated masks bind to threads help show this help message We can see above other 6 parameter is auto-generated

But when I want to run the CPU id I specific.

Consumable Resources in Slurm

Slurm, using the default node allocation plug-in, allocates nodes to jobs in exclusive mode.

defalut: exclusive

cons_res

select/cons_res;

Consumable resources has been enhanced with several new resources --namely CPU (same as in previous version), Socket, Core, Memory as well as any combination of the logical processors with Memory:

CPU (CR_CPU): CPU as a consumable resource.

No notion of sockets, cores, or threads.
On a multi-core system CPUs will be cores.
On a multi-core/hyperthread system CPUs will be threads.
On a single-core systems CPUs are CPUs. ;-)

CPU (CR_CPU): CPU as a consumable resource.

Board (CR_Board): Baseboard as a consumable resource.

Socket (CR_Socket): Socket as a consumable resource.

Core (CR_Core): Core as a consumable resource.

Memory (CR_Memory) Memory only as a consumable resource. Note! CR_Memory assumes OverSubscribe=Yes

Socket and Memory (CR_Socket_Memory): Socket and Memory as consumable resources.

Core and Memory (CR_Core_Memory): Core and Memory as consumable resources.

CPU and Memory (CR_CPU_Memory) CPU and Memory as consumable resources.

example: srun -N 5 -n 20 --mem=10 sleep 100 & <-- running

On many systems with larger processor count, jobs typically run one fewer task than there are processors to minimize interference by the kernel and daemons.

Resource Binding

The highest priority will be that specified using the srun --cpu-bind option. The next highest priority binding will be the node-specific binding if any node in the job allocation has some CpuBind configuration parameter and all other nodes in the job allocation either have the same or no CpuBind configuration parameter. The next highest priority binding will be the partition-specific CpuBind configuration parameter (if any). The lowest priority binding will be that specified by the TaskPluginParam configuration parameter.

Cgroups

Cgroups - a container;
Subsystem - a module, typically a resource controller, that applies a set of parameters to the cgroups in a hierarchy.
Hierarchy - a set of cgroups organized in a tree structure, with one or more associated subsystems.
State Objects - pseudofiles that represent the state of a cgroup or apply controls to a cgroup:
tasks - identifies the processes (PIDs) in the cgroup. additional state objects specific to each subsystem.

The Cgroup's functionality in slurm:

The ability to confine jobs and steps to their allocated cpuset.
The ability to bind tasks to sockets, cores and threads within their step's allocated cpuset on a node.
Supports block and cyclic distribution of allocated cpus to tasks for binding.
The ability to confine jobs and steps to specific memory resources.
The ability to confine jobs to their allocated set of generic resources (gres devices).

Note that all these structures apply to a specific compute node. Jobs that use more than one node will have a cgroup structure on each node.

It seems that we need to edit cgroup.conf by ourself to restrict the user's use;

CPU Management User and Administrator Guide

GRES

Generic resource (GRES) scheduling is supported through a flexible plugin mechanism. Support is currently provided for Graphics Processing Units (GPUs), CUDA Multi-Process Service (MPS), and Intel® Many Integrated Core (MIC) processors.

we can use a gres.config and slurm.conf to specify the number of GPU nodes.

1 2	GresTypes=gpu,mps,bandwidth NodeName=tux[0-7] Gres=gpu:tesla:2,gpu:kepler:2,mps:400,bandwidth:lustre:no_consume:4G

AutoDetect=nvml
Name=gpu Type=gp100  File=/dev/nvidia0 Cores=0,1
Name=gpu Type=gp100  File=/dev/nvidia1 Cores=0,1
Name=gpu Type=p6000  File=/dev/nvidia2 Cores=2,3
Name=gpu Type=p6000  File=/dev/nvidia3 Cores=2,3
Name=mps Count=200  File=/dev/nvidia0
Name=mps Count=200  File=/dev/nvidia1
Name=mps Count=100  File=/dev/nvidia2
Name=mps Count=100  File=/dev/nvidia3
Name=bandwidth Type=lustre Count=4G Flags=CountOnly

SLURM support gpus scheduling

Jobs will not be allocated any generic resources unless specifically requested at job submit time using the options:

--gres Generic resources required per node --gpus GPUs required per job --gpus-per-node GPUs required per node. Equivalent to the --gres option for GPUs. --gpus-per-socket GPUs required per socket. Requires the job to specify a task socket. --gpus-per-task GPUs required per task. Requires the job to specify a task count.

In summary, we can specify gpu;mps;mic;bounding bandwidth; gres; gpus; gpus-per-node;gpu-persocket;

GRES SCHEDULING

we can

Heterogeneous Job Support

For example, part of a job might require four cores and 4 GB for each of 128 tasks while another part of the job would require 16 GB of memory and one CPU.

Containers Guide

DOCKER The issue that usually stops most sites from using Docker is the requirement of "only trusted users should be allowed to control your Docker daemon" [Docker Security] which is not acceptable to most HPC systems.

Sites with trusted users can add them to the docker Unix group and allow them control Docker directly from inside of jobs. There is currently no direct support for starting or stopping docker containers in Slurm.

Charliecloud Charliecloud is user namespace container system sponsored by LANL to provide HPC containers. Charliecloud supports the following: Directly called by users via user namespace support. Direct Slurm support currently in development. OCI Image support (via wrapper)
Shifter Shifter is a container project out of NERSC to provide HPC containers with full scheduler integration.
Singularity is hybrid container system that supports:
ENROOT

We can see that we can specify the number of GPU we need to use through slurm; we can restrict the resources (except memory and NIC for the task through the command);

Zhongzhu's Blog

Slurm-Day4