Article Index

3 Slurm Integration

From an HPC perspective it is assumed end-users prefer to conceptualize computing in terms of machines and not be responsible for composing a machine. In particular, most HPC clusters use a job scheduler to manage multiple users and cluster resources. From this perspective, users submit jobs that request resources. These resources can be cluster nodes, processor cores, memory, storage, GPU nodes (accelerators), software licenses, etc.

Of particular interest is the use of multi-GPU machines for application acceleration. One possible approach, explored here as a Proof-of-Concept (PoC), is the use of "alias machines" that are composed at run-time by the scheduler. Thus, users are presented with "existing" machines options rather than configuring a machine (presumably in the batch submission script).

3.1 Fooling Slurm

Most resource schedulers manage static hardware. In order to use the popular Slurm (2) resource manager, it is necessary to convince Slurm there are more machines (nodes) than actually exist in the cluster.

Another factor with composable computing is the current need to reboot servers when the resources on the PCIe bus. This is tantamount to turning the server off and inserting or removing PCI cards (in this case GPUs) Fortunately modern servers have the ability to be remotely power cycled using IPMI. In addition, Slurm has a built-in power manage mechanism to turn nodes on and off as needed based on the work queue. In addition to these standard capabilities, Slurm had to be fooled into thinking alias nodes exist.

One approach is to keep all nodes turned off, when a requested node is needed, Slurm turns the node on (and configures the PCIe fabric) using the power saving resume script mechanism.

3.2 Alias Node Networking

Using the hardware environment described above, we compose four additional machines on two separate networks (kraken-a and leviathan-a were already defined).
192.168.88.210 kraken-a
192.168.88.212 leviathan-a
192.168.2.210 kraken-a-2gpu
192.168.2.212 leviathan-a-2gpu
192.168.4.210 kraken-a-4gpu
192.168.4.212 leviathan-a-4gpu

In order to route network traffic to the appropriate alias nodes, routing was configured to share the eth0 interface for each network.

route add -net 192.168.2.0/24 dev eth0
route add -net 192.168.4.0/24 dev eth0

On the corresponding alias nodes, an alias interface was configured using ifconfig (e.g., on leviathan-a the interface for leviathan-a-2gpu is enabled as follows)

ifconfig eno1:0 192.168.2.212 up

On both kraken-a and leviathan-a, a simple systemd service was added to run at startup. The service does the following:

  1. Count the number of GPUs available (as provided by the PCIe fabric switch)
  2. Enable the interface that corresponds to the number of GPUs (e.g., if four GPUs were found on kraken-a, the following alias interface (kraken-a-4gpu) is created ifconfig eno1:0 192.168.4.210 up
  3. Slurmd is started on the node using the alias node name
    /bin/slurmd -N kraken-a-4gpu
At this point, Slurm will believe kraken-a-4gpu is available for use.

3.3 Add New Nodes to slurm.conf

In order to make Slurm aware of the alias nodes, the node names and new queues are added to the slurm.conf file. The following abbreviated listing shows how Slurm was configured.

PartitionName=normal Default=YES Nodes=kraken-a,leviathan-a 
PartitionName=2gpu Default=NO Nodes=kraken-a-2gpu,leviathan-a-2gpu
PartitionName=4gpu Default=NO Nodes=kraken-a-4gpu,leviathan-a-4gpu
NodeName=kraken-a Sockets=2 CoresPerSocket=10 ....
NodeName=kraken-a-2gpu Sockets=2 CoresPerSocket=10 ...
NodeName=kraken-a-4gpu Sockets=2 CoresPerSocket=10 ...
NodeName=leviathan-a Sockets=2 CoresPerSocket=10 ...
NodeName=leviathan-a-2gpu Sockets=2 CoresPerSocket=10 ... 
NodeName=leviathan-a-4gpu Sockets=2 CoresPerSocket=10 ...

The output of sinfo indicate three queues (normal has no GPUs) the following (output compressed and abbreviated):

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up inf 2 idle~ kraken-a,leviathan-a
2gpu    up inf 2 idle~ kraken-a-2gpu,leviathan-a-2gpu
4gpu    up inf 2 idle~ kraken-a-4gpu,leviathan-a-4gpu

Notice the "~" after the idle state. This indicates that the node is powered down and can be resumed when needed.

You have no rights to post comments

Search

Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.

Feedburner


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.