3 Slurm Integration
From an HPC perspective it is assumed end-users prefer to conceptualize computing in terms of machines and not be responsible for composing a machine. In particular, most HPC clusters use a job scheduler to manage multiple users and cluster resources. From this perspective, users submit jobs that request resources. These resources can be cluster nodes, processor cores, memory, storage, GPU nodes (accelerators), software licenses, etc.Of particular interest is the use of multi-GPU machines for application acceleration. One possible approach, explored here as a Proof-of-Concept (PoC), is the use of "alias machines" that are composed at run-time by the scheduler. Thus, users are presented with "existing" machines options rather than configuring a machine (presumably in the batch submission script).
3.1 Fooling Slurm
Most resource schedulers manage static hardware. In order to use the popular Slurm (2) resource manager, it is necessary to convince Slurm there are more machines (nodes) than actually exist in the cluster.Another factor with composable computing is the current need to reboot servers when the resources on the PCIe bus. This is tantamount to turning the server off and inserting or removing PCI cards (in this case GPUs) Fortunately modern servers have the ability to be remotely power cycled using IPMI. In addition, Slurm has a built-in power manage mechanism to turn nodes on and off as needed based on the work queue. In addition to these standard capabilities, Slurm had to be fooled into thinking alias nodes exist.
One approach is to keep all nodes turned off, when a requested node is needed, Slurm turns the node on (and configures the PCIe fabric) using the power saving resume script mechanism.
3.2 Alias Node Networking
Using the hardware environment described above, we compose four additional machines on two separate networks (kraken-a and leviathan-a were already defined).192.168.88.210 kraken-a 192.168.88.212 leviathan-a 192.168.2.210 kraken-a-2gpu 192.168.2.212 leviathan-a-2gpu 192.168.4.210 kraken-a-4gpu 192.168.4.212 leviathan-a-4gpu
In order to route network traffic to the appropriate alias nodes, routing was configured to share the eth0 interface for each network.
route add -net 192.168.2.0/24 dev eth0 route add -net 192.168.4.0/24 dev eth0
On the corresponding alias nodes, an alias interface was configured using ifconfig (e.g., on leviathan-a the interface for leviathan-a-2gpu is enabled as follows)
ifconfig eno1:0 192.168.2.212 up
On both kraken-a and leviathan-a, a simple systemd service was added to run at startup. The service does the following:
- Count the number of GPUs available (as provided by the PCIe fabric switch)
- Enable the interface that corresponds to the number of GPUs (e.g., if four GPUs were found on kraken-a, the following alias interface (kraken-a-4gpu) is created ifconfig eno1:0 192.168.4.210 up
- Slurmd is started on the node using the alias node name
/bin/slurmd -N kraken-a-4gpu
3.3 Add New Nodes to slurm.conf
In order to make Slurm aware of the alias nodes, the node names and new queues are added to the slurm.conf file. The following abbreviated listing shows how Slurm was configured.
PartitionName=normal Default=YES Nodes=kraken-a,leviathan-a PartitionName=2gpu Default=NO Nodes=kraken-a-2gpu,leviathan-a-2gpu PartitionName=4gpu Default=NO Nodes=kraken-a-4gpu,leviathan-a-4gpu NodeName=kraken-a Sockets=2 CoresPerSocket=10 .... NodeName=kraken-a-2gpu Sockets=2 CoresPerSocket=10 ... NodeName=kraken-a-4gpu Sockets=2 CoresPerSocket=10 ... NodeName=leviathan-a Sockets=2 CoresPerSocket=10 ... NodeName=leviathan-a-2gpu Sockets=2 CoresPerSocket=10 ... NodeName=leviathan-a-4gpu Sockets=2 CoresPerSocket=10 ...
The output of sinfo indicate three queues (normal has no GPUs) the following (output compressed and abbreviated):
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST normal* up inf 2 idle~ kraken-a,leviathan-a 2gpu up inf 2 idle~ kraken-a-2gpu,leviathan-a-2gpu 4gpu up inf 2 idle~ kraken-a-4gpu,leviathan-a-4gpu
Notice the "~" after the idle state. This indicates that the node is powered down and can be resumed when needed.