First Experiences with Composable Hardware: An HPC User Perspective

Note: this paper was prepared for a conference that we decided not to attend (Okay, it was not accepted). It is written in a more formal style than the normal ClusterMonkey articles and is sponsored by the The Beowulf Foundation

Abstract

Popular homogeneous clustered HPC systems (e.g., commodity x86 servers connected by a high-speed interconnect) have given way to heterogeneous clusters comprised of multi-core servers, high speed interconnects, accelerators (often GPU based), and custom storage arrays. Cluster designers are often faced with finding a balance between purpose-built (tailored to specific problem domains ) and general use systems. Traditional cluster-based approaches, however, all share a hard boundary between internal server buses (mainly PCIe) and the rest of the cluster. In heterogeneous environments, the server boundary often creates inefficient resource management, limits solution flexibility, and heavily influences the design of clustered HPC applications. This paper explores the malleability of the GigaIO™ FabreX™ PCIe memory fabric in relation to HPC cluster applications. A discussion of emerging concepts (e.g., a routable PCIe bus) and hands-on benchmarks using shared GPUs will be provided. In addition, results of a simple integration with SLURM resource scheduler will be discussed as way to make composable/malleable computing transparently available to end-users. Keywords. Composable computing, malleable computing, PCIe, HPC cluster, SLURM, benchmark, FabreX , GigaIO, resource scheduler

1 Background

In many HPC installations, popular homogeneous cluster designs have given way to heterogeneous systems often with varying amounts and types of hardware. This hardware is fixed within server boundaries and often limits the ability of end-users to maximize performance across multiple servers. Composable computing (or malleable computing) offers a way to create resources that better fit end-user applications.

The primary way to move past the server boundary has been to send data over a network. In the HPC sector, this is accomplished with high-speed Ethernet or InfiniBand networks. The preferred solution by many users, however, is the ability to "share" or "switch" the PCIe bus fabric between clustered HPC servers.

As an example, in many HPC clusters GPU resources are located on specific nodes and often the number of GPUs per node is fixed (e.g., two GPUs per server). This situation requires users that would like to apply more GPUs to their application (e.g., four) to run jobs across servers using the network-based Message Passing Interface (MPI). Oftentimes the disparity between the network and PCIe bus creates bottlenecks and less efficient operation.

There are large "GPU count" servers available, but as the number of GPUs increases, so does the cost. Organizations can find it difficult to justify large GPU systems for a handful of users while a majority of other users can use one or two GPUs per node. The development of the Compute Express Link (1) (CXL) standard is underway and expected to be adopted by most major vendors in the future as a solution to this challenge.

There are currently available composable computing solutions like those offered by the GigaIO™ Networks. GigaIO offers a composable option for servers using the FabreX™ PCIe switch. FabreX allows the PCIe fabric on a server to be connected (and disconnected) to additional PCIe resources, and to other servers with their own PCIe tree.

As this paper will indicate, it is possible to compose machines with a varying number of GPUs using a configurable fabric of PCI channels. As will be shown, GPU resources can be added (and removed) from HPC servers without physically moving resources (i.e., moving cards between servers). This capability provides the ability to concentrate resources when needed and easily distribute them otherwise. A proof-of-concept (PoC) example shows how the Slurm resource manager can be used to manage a composable machine for end-users.

1.1 Hardware Environment

In order to study the basic functionally of composable hardware, we used three servers, an Ethernet switch, a GigaIO™ FabreX™ PCIe Gen-3 switch, two FabreX PCIe Network Adapters (for worker nodes), and an external GigaIO Accelerator Pooling Appliance (holds the four GPUs) The hardware configuration is described in Fig. 1. Note that hardware used for this investigation is based on the older GigaIO Gen-3 (PCIe 3.0) components while the current Gen-4 hardware offers PCIe 5.0 support and expanded features.

Fig. 1. Hardware Layout used for analysis Head4-a is the director/control server that controls the FabreX switch and acts as the Slurm control node (user jobs are submitted from this node)

Kraken-a and Leviathan-a are two compute worker nodes that are connected to the FabreX switch using two FabreX PCIe Network Adapters. These nodes also serve as Slurm worker nodes. The GPUs are located in a GigaIO pooling appliance (i.e., they are not housed in the two worker nodes).

A closer look at the default PCIe Fabric is presented in Fig 2.

Fig. 2. Default PCIE fabric

In Fig. 2 kraken-a is Host 1 and leviathan-a is Host 2. The GPUs in the pooling appliance are located on IO1 and IO3. The color indicates a separate PCIe partition (separate bus). The Host only sees what is attached (switched) to its partition. The default switch configuration can be seen in Fig. 3. There are two partitions (PAR; 0 and 1, left most column). Kraken-a has four PCIe lanes connected to FBRXPORT ports 1,2,3,4 (indicated on the switch display as "1..4.") Two of the GPUs are connected via ports 9..12 in partition 0. Thus, when kraken-a is powered up, it will detect two GPUs. Leviathan-a also has two GPUs (ports 17..20) connected to partition 1 via Leviathan ports 5..9 and will see these when powered-up. Note ports 13..16 and 21..24 are not used in this setup.

As will be illustrated below, moving GPUs between from leviathan-a to kraken-a happens when the FabreX switch moves ports 17..20 to partition 0.

Fig. 3. GigIO FabreX default switch configuration (Via Web GUI)

2 Basic Tests

In order to perform some basic numeric tests, the NVidia multi–GPU Programming Models Github site was consulted. (https://github.com/NVIDIA/multi-gpu-programming-models) This site provides source code for a well-known multi-GPU Jacobi solver with different multi–GPU Programming Models. For simplicity we chose the multi-threaded with OpenMP using Cuda Memcpy for inter-GPU communication. All examples were compiled with nvcc Cuda compilation tools, release 11.4, V11.4.152.

The resulting binary was launched on two GPUs using the simple script: export CUDA_VISIBLE_DEVICES=0,1 export OMP_PLACES={0},{1} ./jacobi -nx 32768 -ny 32768 -niter 1500 Note, CUDA_VISIBLE_DEVICES and OMP_PLACES limits the run to using two threads and two GPUs. Before running the number of GPUs was confirmed with the following command: lspci|grep -i nvidia a5:00.0 3D controller: NVIDIA Corporation GP100GL ... a6:00.0 3D controller: NVIDIA Corporation GP100GL ... The output demonstrated perfect speed up (as would be expected). Num GPUs: 2. 32768x32768: 1 GPU: 50.8043 s, 2 GPUs: 25.2834 s, speedup: 2.01, efficiency: 100.47

The next test was to move the two GPUs from kraken-a to leviathan-a and re-run the Jacobi solver example with four GPUs. All switch configuration was done on head4-a. This step requires moving ports 9..12 from partition 0 to partition 1 using the fmtool FabreX switch management tool. (The command line nature of this tool is important in the Slurm section.) First, both kraken-a and leviathan-a were powered down. Next, the switch (network name "virgo39-a") is told to unassign ("-U") ports 9..12 (only the first port number is needed). fmtool -U switch:virgo39-a,port_id:9 virgo39-a Next, the ports 9..12 are bound ("-B") to partition 1 ("part_id=1") on the switch. fmtool -B switch:virgo39-a,part_id:1,port_id:9 virgo39-a As seen in Fig. 4, the new assignments can be viewed on the FabreX switch web GUI. Leviathan-a has four GPUs and kraken has no GPUs

Fig. 4. FabreX switch web GUI with all GPUs connected to partition 1 (leviathan-a)

Both kraken-a and leviathan-a are then rebooted and a check on leviathan-a now reveals four GPUs.

lspci|grep -i nvidia
a5:00.0 3D controller: NVIDIA Corporation GP100GL ...
a6:00.0 3D controller: NVIDIA Corporation GP100GL ...
af:00.0 3D controller: NVIDIA Corporation GP100GL ...
b0:00.0 3D controller: NVIDIA Corporation GP100GL ...

Next, the Jacobi solver was run on four GPUs using the following script:

export CUDA_VISIBLE_DEVICES=0,1,2,3
export OMP_PLACES={0},{1},{2},{3}
./jacobi -nx 32768  -ny 32768 -niter 1500

Scaling (speedup) was less than the two GPU case, but still very good.

Num GPUs: 4.
32768x32768: 1 GPU:  47.7831 s, 4 GPUs:  12.7408 s, speedup:     3.75, efficiency:    93.76

Another interesting variation was to use one GPU from each port range. This was accomplished by setting the CUDA_VISIBLE_DEVICES in the run script for two GPUs.

export CUDA_VISIBLE_DEVICES=0,3

The results were identical (perfect scaling) to the case where the two GPUs were on the same port range. For completeness, a final step is to return the switch to its default initialization from the head-4-a node. This was done as follows. First, as before, kraken-a and leviathan-s were powered down. Then the switch was uninitialized ("-u") using the following command. fmtool -u virgo39-a

Next the default topology needed to be loaded ("-vf") after moving to the correct topology directory.

cd /opt/gigaio-fabrexfm-tool/topologies/release/sj1/1S-2x4-X1
fmtool -vf virgo39a.host-list.json ltr.yml virgo39-a

3 Slurm Integration

From an HPC perspective it is assumed end-users prefer to conceptualize computing in terms of machines and not be responsible for composing a machine. In particular, most HPC clusters use a job scheduler to manage multiple users and cluster resources. From this perspective, users submit jobs that request resources. These resources can be cluster nodes, processor cores, memory, storage, GPU nodes (accelerators), software licenses, etc.

Of particular interest is the use of multi-GPU machines for application acceleration. One possible approach, explored here as a Proof-of-Concept (PoC), is the use of "alias machines" that are composed at run-time by the scheduler. Thus, users are presented with "existing" machines options rather than configuring a machine (presumably in the batch submission script).

3.1 Fooling Slurm

Most resource schedulers manage static hardware. In order to use the popular Slurm (2) resource manager, it is necessary to convince Slurm there are more machines (nodes) than actually exist in the cluster.

Another factor with composable computing is the current need to reboot servers when the resources on the PCIe bus. This is tantamount to turning the server off and inserting or removing PCI cards (in this case GPUs) Fortunately modern servers have the ability to be remotely power cycled using IPMI. In addition, Slurm has a built-in power manage mechanism to turn nodes on and off as needed based on the work queue. In addition to these standard capabilities, Slurm had to be fooled into thinking alias nodes exist.

One approach is to keep all nodes turned off, when a requested node is needed, Slurm turns the node on (and configures the PCIe fabric) using the power saving resume script mechanism.

3.2 Alias Node Networking

Using the hardware environment described above, we compose four additional machines on two separate networks (kraken-a and leviathan-a were already defined).

192.168.88.210 kraken-a
192.168.88.212 leviathan-a
192.168.2.210 kraken-a-2gpu
192.168.2.212 leviathan-a-2gpu
192.168.4.210 kraken-a-4gpu
192.168.4.212 leviathan-a-4gpu

In order to route network traffic to the appropriate alias nodes, routing was configured to share the eth0 interface for each network.

route add -net 192.168.2.0/24 dev eth0
route add -net 192.168.4.0/24 dev eth0

On the corresponding alias nodes, an alias interface was configured using ifconfig (e.g., on leviathan-a the interface for leviathan-a-2gpu is enabled as follows)

ifconfig eno1:0 192.168.2.212 up

On both kraken-a and leviathan-a, a simple systemd service was added to run at startup. The service does the following:

Count the number of GPUs available (as provided by the PCIe fabric switch)
Enable the interface that corresponds to the number of GPUs (e.g., if four GPUs were found on kraken-a, the following alias interface (kraken-a-4gpu) is created ifconfig eno1:0 192.168.4.210 up
Slurmd is started on the node using the alias node name
/bin/slurmd -N kraken-a-4gpu

At this point, Slurm will believe kraken-a-4gpu is available for use.

3.3 Add New Nodes to slurm.conf

In order to make Slurm aware of the alias nodes, the node names and new queues are added to the slurm.conf file. The following abbreviated listing shows how Slurm was configured.

PartitionName=normal Default=YES Nodes=kraken-a,leviathan-a 
PartitionName=2gpu Default=NO Nodes=kraken-a-2gpu,leviathan-a-2gpu
PartitionName=4gpu Default=NO Nodes=kraken-a-4gpu,leviathan-a-4gpu
NodeName=kraken-a Sockets=2 CoresPerSocket=10 ....
NodeName=kraken-a-2gpu Sockets=2 CoresPerSocket=10 ...
NodeName=kraken-a-4gpu Sockets=2 CoresPerSocket=10 ...
NodeName=leviathan-a Sockets=2 CoresPerSocket=10 ...
NodeName=leviathan-a-2gpu Sockets=2 CoresPerSocket=10 ... 
NodeName=leviathan-a-4gpu Sockets=2 CoresPerSocket=10 ...

The output of sinfo indicate three queues (normal has no GPUs) the following (output compressed and abbreviated):

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up inf 2 idle~ kraken-a,leviathan-a
2gpu    up inf 2 idle~ kraken-a-2gpu,leviathan-a-2gpu
4gpu    up inf 2 idle~ kraken-a-4gpu,leviathan-a-4gpu

Notice the "~" after the idle state. This indicates that the node is powered down and can be resumed when needed.

3.4 Power Control and PCIe Configuration

As mentioned, the Slurm Power Saving (3) feature can be used to turn nodes off and on depending on the job requests in the work queues. Because power control is site dependent, system administrators normally write the script for their given environment. The basic power control scripts can be user defined, but are usually written as shell scripts and nominally called slurm-resume.sh and slurm-suspend.sh. Both scripts often use IPMI commands to remotely start and stop cluster nodes. There is no reason why these scripts cannot also be used to configure the PCIe fabric before the node is powered on.

The slurm-resume.sh and slurm-suspend.sh scripts are somewhat lengthy, but relatively simple. Slurm provides a list of nodes to each script that need to be powered up or down. The suspend script performs two simple tasks in the following order

Power-down the node using IPMI
Return the PCIe fabric switch to the default setting

The result script performs the following task in the following order

Configure the PCIe fabric switch based on the node name. For example, if a request for node leviathan-a-4gpu, it will use the fmtool command described above to move all four GPUs to the leviathan-a node.
Power-up the node using IPMI

The following log output describes the process for starting and stopping a node using the slurm-resume.sh and the slurm-suspend.sh

Resume invoked /root/Slurm/slurm-resume.sh kraken-a-4gpu
Switch for kraken-a-4gpu
kraken-a-4gpu requested no GPUs or 4 GPUs 
fmtool -U switch:virgo39-a,port_id:17 virgo39-a result: SUCCESS: sent unbind request to virgo39-a
fmtool -B switch:virgo39-a,part_id:0,port_id:17 virgo39-a result: SUCCESS: sent binding information to virgo39-a
IPMI kracken power up result: 0

... [User Job Runs]

Suspend invoked /root/Slurm/slurm-suspend.sh kraken-a-4gpu
IPMI kracken power down result: 0
fmtool uninitialize virgo39-a result: SUCCESS: uninitialize fabric complete on virgo39-a
fmtool apply default topology to virgo39-a result: SUCCESS: sent virgo39a.host-list.json to virgo39-a
SUCCESS: sent ltr.yml to virgo39-a

In the case of kraken-a-2gpu and leviathan-a-2gpu there is no switch configuration needed because the default configuration gives both machines two GPUs. In addition, requests for kraken-a or leviathan-a move all GPUs to the other node.

3.5 Protecting the Node from Slurm

Since Slurm believes there are more nodes than really exist, it is important to make sure the other alias nodes are unavailable (set to the DRAIN state using scontrol) when the actual node is in use otherwise Slurm may try to use the nodes. This step can be accomplished using the slurmctld prolog and epilog scripts (script path defined in slurm.conf). These scripts are run on the director node (where slurmctld runs) before and after each job. Note, the Slurm authors discourage the use of "scontrol" in prolog and epilog scripts, however, we find it acceptable for this PoC work. The prolog script is as follows:

#!/bin/bash
# slurm.conf: PrologSlurmctld=/etc/slurm/prolog.sh
NODES="kraken-a kraken-a-2gpu kraken-a-4gpu leviathan-a \
leviathan-a-2gpu leviathan-a-4gpu"
for N in $NODES;do
# shut down other alias nodes
if [ $N != $SLURM_JOB_NODELIST ]; then
  /bin/scontrol update NodeName=$N State=DRAIN Reason=””
fi
done

The epilog script is as follows:

#!/bin/bash
# slurm.conf: EpilogSlurmctld=/etc/slurm/epilog.sh
NODES="kraken-a kraken-a-2gpu kraken-a-4gpu leviathan-a leviathan-a-2gpu leviathan-a-4gpu"
/bin/echo "`date` In epilog " >>/tmp/sout.log
/bin/echo "`date` require node $SLURM_JOB_NODELIST " >> /tmp/sout.log
for N in $NODES;do
if [ $N != $SLURM_JOB_NODELIST ]; then
  /bin/echo "`date` undraining $N" >>/tmp/sout.log
  /bin/scontrol update NodeName=$N State=IDLE
fi
done

There is some potential to optimize this configuration. Once the current node is configured, the second node has a usable configuration and does not have to be taken out of use. 3.6 Summary of Job Flow

The diagram in Fig. 2 indicates the complete sbatch job flow for both the director node and a worker node. The valid nodes and queues are shown in Table 1.

Queue	Machine	GPUs
normal	kraken-a, leviathan-a	0
2gpu	kraken-a-2gpu, leviathan-a-2gpu	2
4gpu	kraken-a-4gpu, leviathan-a-4gpu	4

Table 1. Slurm alias nodes Queues

The possible combinations are (two nodes at any one time):

kraken-a, leviathan-a-4gpu
kraken-a-4gpu, leviathan-a
kraken-a-2gpu, leviathan-a-2-gpu

Fig. 5. Slurm Job Flow for Composable Nodes

3.7 Slurm Job Results

In order to test the queues, three simple scripts were created:

slurm-test.sh - request 0 GPUs from the normal queue
slurm-test-gpu2.sh - request 2 GPUs from the 2gpu queue
slurm-test-gpu4.sh request 4 GPUs from the 4gpu queue

Each script counts the number of GPUs available (using lspci) and waited 30 seconds before completing. The pertinent part of slurm-test-gpu4.sh is shown below.

#SBATCH --partition=4gpu
SLEEPTIME=30
ME=$(hostname)
GPUS=$(lspci|grep -i nvidia|wc -l)
echo "My name is $ME and I have $GPUS GPUs"
echo Sleeping for $SLEEPTIME
sleep $SLEEPTIME
echo done

The correct number of GPUs was reported for each script as indicated in Table 1 above. While the test was running, an sinfo command was run to show the state of the queues. (output compressed and abbreviated):

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal* up inf 2 drain~ kraken-a,leviathan-a
2gpu    up inf 2 drain~ kraken-a-2gpu,leviathan-a-2gpu
4gpu    up inf 1 alloc# kraken-a-4gpu
4gpu    up inf 1 drain~ leviathan-a-4gpu

Notice, all the other nodes are in the drain configuration (not available) and the "#" next to a "alloc" indicates the node is allocated and is in the power-up state.

4 Conclusions

The tests and integration concepts presented here are based on a "first look" at the hardware. More investigation is needed with additional hardware. It is, however, possible to draw some initial conclusions.

Composing systems for HPC seems to work. There does not seem to be a loss of performance with the addition of GPU based resources.
Integration with existing resource schedulers (e.g., Slurm) seems possible, however, more work is needed to create a production ready environment. This "masquerade" approach lets users think about machines and not configuration when running jobs.
In terms of using a scheduler to configure the PCIe fabric, more investigation into safe switch reconfiguration is needed. E.g., making sure that a new PCIe configuration does not change any other node's PCIe configuration while it is running. This PoC did not address this issue.
While rebooting servers does work, server boot times can be annoyingly long. In addition, some sites prefer to not reboot servers unless absolutely necessary. This may limit some of the methods explored here. It is expected when a rapid and standard PCIe bus rescan is available, this will remove the need to reboot systems and make scripts like the Slurm suspend and resume much more efficient.

The author would like to thank GigaIO Networks for the use and assistance of their hardware. The Beowulf Foundation mission is to foster and support advanced technical computing (HPC, Data Analytics, Artificial Intelligence, etc.) through commodity and open source-driven innovation and open collaboration.

All Software is available by contacting the author and is expected to be on Beowulf Foundation Github by time of publication.