Article Index

Note: this paper was prepared for a conference that we decided not to attend (Okay, it was not accepted). It is written in a more formal style than the normal ClusterMonkey articles and is sponsored by the The Beowulf Foundation

Abstract

Popular homogeneous clustered HPC systems (e.g., commodity x86 servers connected by a high-speed interconnect) have given way to heterogeneous clusters comprised of multi-core servers, high speed interconnects, accelerators (often GPU based), and custom storage arrays. Cluster designers are often faced with finding a balance between purpose-built (tailored to specific problem domains ) and general use systems. Traditional cluster-based approaches, however, all share a hard boundary between internal server buses (mainly PCIe) and the rest of the cluster. In heterogeneous environments, the server boundary often creates inefficient resource management, limits solution flexibility, and heavily influences the design of clustered HPC applications. This paper explores the malleability of the GigaIO™ FabreX™ PCIe memory fabric in relation to HPC cluster applications. A discussion of emerging concepts (e.g., a routable PCIe bus) and hands-on benchmarks using shared GPUs will be provided. In addition, results of a simple integration with SLURM resource scheduler will be discussed as way to make composable/malleable computing transparently available to end-users. Keywords. Composable computing, malleable computing, PCIe, HPC cluster, SLURM, benchmark, FabreX , GigaIO, resource scheduler

1 Background

In many HPC installations, popular homogeneous cluster designs have given way to heterogeneous systems often with varying amounts and types of hardware. This hardware is fixed within server boundaries and often limits the ability of end-users to maximize performance across multiple servers. Composable computing (or malleable computing) offers a way to create resources that better fit end-user applications.

The primary way to move past the server boundary has been to send data over a network. In the HPC sector, this is accomplished with high-speed Ethernet or InfiniBand networks. The preferred solution by many users, however, is the ability to "share" or "switch" the PCIe bus fabric between clustered HPC servers.

As an example, in many HPC clusters GPU resources are located on specific nodes and often the number of GPUs per node is fixed (e.g., two GPUs per server). This situation requires users that would like to apply more GPUs to their application (e.g., four) to run jobs across servers using the network-based Message Passing Interface (MPI). Oftentimes the disparity between the network and PCIe bus creates bottlenecks and less efficient operation.

There are large "GPU count" servers available, but as the number of GPUs increases, so does the cost. Organizations can find it difficult to justify large GPU systems for a handful of users while a majority of other users can use one or two GPUs per node. The development of the Compute Express Link (1) (CXL) standard is underway and expected to be adopted by most major vendors in the future as a solution to this challenge.

There are currently available composable computing solutions like those offered by the GigaIO™ Networks. GigaIO offers a composable option for servers using the FabreX™ PCIe switch. FabreX allows the PCIe fabric on a server to be connected (and disconnected) to additional PCIe resources, and to other servers with their own PCIe tree. 

As this paper will indicate, it is possible to compose machines with a varying number of GPUs using a configurable fabric of PCI channels. As will be shown, GPU resources can be added (and removed) from HPC servers without physically moving resources (i.e., moving cards between servers). This capability provides the ability to concentrate resources when needed and easily distribute them otherwise. A proof-of-concept (PoC) example shows how the Slurm resource manager can be used to manage a composable machine for end-users.

1.1 Hardware Environment

In order to study the basic functionally of composable hardware, we used three servers, an Ethernet switch, a GigaIO™ FabreX™ PCIe Gen-3 switch, two FabreX PCIe Network Adapters (for worker nodes), and an external GigaIO Accelerator Pooling Appliance (holds the four GPUs) The hardware configuration is described in Fig. 1. Note that hardware used for this investigation is based on the older GigaIO Gen-3 (PCIe 3.0) components while the current Gen-4 hardware offers PCIe 5.0 support and expanded features.


Fig. 1. Hardware Layout used for analysis Head4-a is the director/control server that controls the FabreX switch and acts as the Slurm control node (user jobs are submitted from this node)

Kraken-a and Leviathan-a are two compute worker nodes that are connected to the FabreX switch using two FabreX PCIe Network Adapters. These nodes also serve as Slurm worker nodes. The GPUs are located in a GigaIO pooling appliance (i.e., they are not housed in the two worker nodes).

You have no rights to post comments

Search

Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.

Feedburner


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.