Basic Tests | First Experiences with Composable Hardware: An HPC User Perspective

Home

Reviews

First Experiences with Composable Hardware: An HPC User Perspective

Details: Written by Douglas Eadline; Published: 29 September 2022; Hits: 14207

Article Index

Page 2 of 6

A closer look at the default PCIe Fabric is presented in Fig 2.

Fig. 2. Default PCIE fabric

In Fig. 2 kraken-a is Host 1 and leviathan-a is Host 2. The GPUs in the pooling appliance are located on IO1 and IO3. The color indicates a separate PCIe partition (separate bus). The Host only sees what is attached (switched) to its partition. The default switch configuration can be seen in Fig. 3. There are two partitions (PAR; 0 and 1, left most column). Kraken-a has four PCIe lanes connected to FBRXPORT ports 1,2,3,4 (indicated on the switch display as "1..4.") Two of the GPUs are connected via ports 9..12 in partition 0. Thus, when kraken-a is powered up, it will detect two GPUs. Leviathan-a also has two GPUs (ports 17..20) connected to partition 1 via Leviathan ports 5..9 and will see these when powered-up. Note ports 13..16 and 21..24 are not used in this setup.

As will be illustrated below, moving GPUs between from leviathan-a to kraken-a happens when the FabreX switch moves ports 17..20 to partition 0.

Fig. 3. GigIO FabreX default switch configuration (Via Web GUI)

2 Basic Tests

In order to perform some basic numeric tests, the NVidia multi–GPU Programming Models Github site was consulted. (https://github.com/NVIDIA/multi-gpu-programming-models) This site provides source code for a well-known multi-GPU Jacobi solver with different multi–GPU Programming Models. For simplicity we chose the multi-threaded with OpenMP using Cuda Memcpy for inter-GPU communication. All examples were compiled with nvcc Cuda compilation tools, release 11.4, V11.4.152.

The resulting binary was launched on two GPUs using the simple script: export CUDA_VISIBLE_DEVICES=0,1 export OMP_PLACES={0},{1} ./jacobi -nx 32768 -ny 32768 -niter 1500 Note, CUDA_VISIBLE_DEVICES and OMP_PLACES limits the run to using two threads and two GPUs. Before running the number of GPUs was confirmed with the following command: lspci|grep -i nvidia a5:00.0 3D controller: NVIDIA Corporation GP100GL ... a6:00.0 3D controller: NVIDIA Corporation GP100GL ... The output demonstrated perfect speed up (as would be expected). Num GPUs: 2. 32768x32768: 1 GPU: 50.8043 s, 2 GPUs: 25.2834 s, speedup: 2.01, efficiency: 100.47

The next test was to move the two GPUs from kraken-a to leviathan-a and re-run the Jacobi solver example with four GPUs. All switch configuration was done on head4-a. This step requires moving ports 9..12 from partition 0 to partition 1 using the fmtool FabreX switch management tool. (The command line nature of this tool is important in the Slurm section.) First, both kraken-a and leviathan-a were powered down. Next, the switch (network name "virgo39-a") is told to unassign ("-U") ports 9..12 (only the first port number is needed). fmtool -U switch:virgo39-a,port_id:9 virgo39-a Next, the ports 9..12 are bound ("-B") to partition 1 ("part_id=1") on the switch. fmtool -B switch:virgo39-a,part_id:1,port_id:9 virgo39-a As seen in Fig. 4, the new assignments can be viewed on the FabreX switch web GUI. Leviathan-a has four GPUs and kraken has no GPUs

Fig. 4. FabreX switch web GUI with all GPUs connected to partition 1 (leviathan-a)

Both kraken-a and leviathan-a are then rebooted and a check on leviathan-a now reveals four GPUs.

lspci|grep -i nvidia
a5:00.0 3D controller: NVIDIA Corporation GP100GL ...
a6:00.0 3D controller: NVIDIA Corporation GP100GL ...
af:00.0 3D controller: NVIDIA Corporation GP100GL ...
b0:00.0 3D controller: NVIDIA Corporation GP100GL ...

Next, the Jacobi solver was run on four GPUs using the following script:

export CUDA_VISIBLE_DEVICES=0,1,2,3
export OMP_PLACES={0},{1},{2},{3}
./jacobi -nx 32768  -ny 32768 -niter 1500

Scaling (speedup) was less than the two GPU case, but still very good.

Num GPUs: 4.
32768x32768: 1 GPU:  47.7831 s, 4 GPUs:  12.7408 s, speedup:     3.75, efficiency:    93.76

Another interesting variation was to use one GPU from each port range. This was accomplished by setting the CUDA_VISIBLE_DEVICES in the run script for two GPUs.

export CUDA_VISIBLE_DEVICES=0,3

The results were identical (perfect scaling) to the case where the two GPUs were on the same port range. For completeness, a final step is to return the switch to its default initialization from the head-4-a node. This was done as follows. First, as before, kraken-a and leviathan-s were powered down. Then the switch was uninitialized ("-u") using the following command. fmtool -u virgo39-a

Next the default topology needed to be loaded ("-vf") after moving to the correct topology directory.

cd /opt/gigaio-fabrexfm-tool/topologies/release/sj1/1S-2x4-X1
fmtool -vf virgo39a.host-list.json ltr.yml virgo39-a

You have no rights to post comments

JComments

Main Menu

Search

Login And Newsletter

Feedburner

Subscribe Now!

Front Page RSS Feed

This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.

First Experiences with Composable Hardware: An HPC User Perspective - Basic Tests