Slurm Power Ctl | First Experiences with Composable Hardware: An HPC User Perspective

Home

Reviews

First Experiences with Composable Hardware: An HPC User Perspective

Details: Written by Douglas Eadline; Published: 29 September 2022; Hits: 13896

Article Index

Page 4 of 6

3.4 Power Control and PCIe Configuration

As mentioned, the Slurm Power Saving (3) feature can be used to turn nodes off and on depending on the job requests in the work queues. Because power control is site dependent, system administrators normally write the script for their given environment. The basic power control scripts can be user defined, but are usually written as shell scripts and nominally called slurm-resume.sh and slurm-suspend.sh. Both scripts often use IPMI commands to remotely start and stop cluster nodes. There is no reason why these scripts cannot also be used to configure the PCIe fabric before the node is powered on.

The slurm-resume.sh and slurm-suspend.sh scripts are somewhat lengthy, but relatively simple. Slurm provides a list of nodes to each script that need to be powered up or down. The suspend script performs two simple tasks in the following order

Power-down the node using IPMI
Return the PCIe fabric switch to the default setting

The result script performs the following task in the following order

Configure the PCIe fabric switch based on the node name. For example, if a request for node leviathan-a-4gpu, it will use the fmtool command described above to move all four GPUs to the leviathan-a node.
Power-up the node using IPMI

The following log output describes the process for starting and stopping a node using the slurm-resume.sh and the slurm-suspend.sh

Resume invoked /root/Slurm/slurm-resume.sh kraken-a-4gpu
Switch for kraken-a-4gpu
kraken-a-4gpu requested no GPUs or 4 GPUs 
fmtool -U switch:virgo39-a,port_id:17 virgo39-a result: SUCCESS: sent unbind request to virgo39-a
fmtool -B switch:virgo39-a,part_id:0,port_id:17 virgo39-a result: SUCCESS: sent binding information to virgo39-a
IPMI kracken power up result: 0

... [User Job Runs]

Suspend invoked /root/Slurm/slurm-suspend.sh kraken-a-4gpu
IPMI kracken power down result: 0
fmtool uninitialize virgo39-a result: SUCCESS: uninitialize fabric complete on virgo39-a
fmtool apply default topology to virgo39-a result: SUCCESS: sent virgo39a.host-list.json to virgo39-a
SUCCESS: sent ltr.yml to virgo39-a

In the case of kraken-a-2gpu and leviathan-a-2gpu there is no switch configuration needed because the default configuration gives both machines two GPUs. In addition, requests for kraken-a or leviathan-a move all GPUs to the other node.

3.5 Protecting the Node from Slurm

Since Slurm believes there are more nodes than really exist, it is important to make sure the other alias nodes are unavailable (set to the DRAIN state using scontrol) when the actual node is in use otherwise Slurm may try to use the nodes. This step can be accomplished using the slurmctld prolog and epilog scripts (script path defined in slurm.conf). These scripts are run on the director node (where slurmctld runs) before and after each job. Note, the Slurm authors discourage the use of "scontrol" in prolog and epilog scripts, however, we find it acceptable for this PoC work. The prolog script is as follows:

#!/bin/bash
# slurm.conf: PrologSlurmctld=/etc/slurm/prolog.sh
NODES="kraken-a kraken-a-2gpu kraken-a-4gpu leviathan-a \
leviathan-a-2gpu leviathan-a-4gpu"
for N in $NODES;do
# shut down other alias nodes
if [ $N != $SLURM_JOB_NODELIST ]; then
  /bin/scontrol update NodeName=$N State=DRAIN Reason=””
fi
done

The epilog script is as follows:

#!/bin/bash
# slurm.conf: EpilogSlurmctld=/etc/slurm/epilog.sh
NODES="kraken-a kraken-a-2gpu kraken-a-4gpu leviathan-a leviathan-a-2gpu leviathan-a-4gpu"
/bin/echo "`date` In epilog " >>/tmp/sout.log
/bin/echo "`date` require node $SLURM_JOB_NODELIST " >> /tmp/sout.log
for N in $NODES;do
if [ $N != $SLURM_JOB_NODELIST ]; then
  /bin/echo "`date` undraining $N" >>/tmp/sout.log
  /bin/scontrol update NodeName=$N State=IDLE
fi
done

There is some potential to optimize this configuration. Once the current node is configured, the second node has a usable configuration and does not have to be taken out of use. 3.6 Summary of Job Flow

The diagram in Fig. 2 indicates the complete sbatch job flow for both the director node and a worker node. The valid nodes and queues are shown in Table 1.

Queue	Machine	GPUs
normal	kraken-a, leviathan-a	0
2gpu	kraken-a-2gpu, leviathan-a-2gpu	2
4gpu	kraken-a-4gpu, leviathan-a-4gpu	4

Table 1. Slurm alias nodes Queues

The possible combinations are (two nodes at any one time):

kraken-a, leviathan-a-4gpu
kraken-a-4gpu, leviathan-a
kraken-a-2gpu, leviathan-a-2-gpu

You have no rights to post comments

JComments

Main Menu

Search

Login And Newsletter

Feedburner

Subscribe Now!

Front Page RSS Feed

This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.

First Experiences with Composable Hardware: An HPC User Perspective - Slurm Power Ctl