3.4 Power Control and PCIe Configuration
As mentioned, the Slurm Power Saving (3) feature can be used to turn nodes off and on depending on the job requests in the work queues. Because power control is site dependent, system administrators normally write the script for their given environment. The basic power control scripts can be user defined, but are usually written as shell scripts and nominally called slurm-resume.sh and slurm-suspend.sh. Both scripts often use IPMI commands to remotely start and stop cluster nodes. There is no reason why these scripts cannot also be used to configure the PCIe fabric before the node is powered on.
The slurm-resume.sh and slurm-suspend.sh scripts are somewhat lengthy, but relatively simple. Slurm provides a list of nodes to each script that need to be powered up or down. The suspend script performs two simple tasks in the following order
- Power-down the node using IPMI
- Return the PCIe fabric switch to the default setting
- Configure the PCIe fabric switch based on the node name. For example, if a request for node leviathan-a-4gpu, it will use the fmtool command described above to move all four GPUs to the leviathan-a node.
- Power-up the node using IPMI
The following log output describes the process for starting and stopping a node using the slurm-resume.sh and the slurm-suspend.sh
Resume invoked /root/Slurm/slurm-resume.sh kraken-a-4gpu Switch for kraken-a-4gpu kraken-a-4gpu requested no GPUs or 4 GPUs fmtool -U switch:virgo39-a,port_id:17 virgo39-a result: SUCCESS: sent unbind request to virgo39-a fmtool -B switch:virgo39-a,part_id:0,port_id:17 virgo39-a result: SUCCESS: sent binding information to virgo39-a IPMI kracken power up result: 0 ... [User Job Runs] Suspend invoked /root/Slurm/slurm-suspend.sh kraken-a-4gpu IPMI kracken power down result: 0 fmtool uninitialize virgo39-a result: SUCCESS: uninitialize fabric complete on virgo39-a fmtool apply default topology to virgo39-a result: SUCCESS: sent virgo39a.host-list.json to virgo39-a SUCCESS: sent ltr.yml to virgo39-a
In the case of kraken-a-2gpu and leviathan-a-2gpu there is no switch configuration needed because the default configuration gives both machines two GPUs. In addition, requests for kraken-a or leviathan-a move all GPUs to the other node.
3.5 Protecting the Node from Slurm
Since Slurm believes there are more nodes than really exist, it is important to make sure the other alias nodes are unavailable (set to the DRAIN state using scontrol) when the actual node is in use otherwise Slurm may try to use the nodes. This step can be accomplished using the slurmctld prolog and epilog scripts (script path defined in slurm.conf). These scripts are run on the director node (where slurmctld runs) before and after each job. Note, the Slurm authors discourage the use of "scontrol" in prolog and epilog scripts, however, we find it acceptable for this PoC work. The prolog script is as follows:
#!/bin/bash # slurm.conf: PrologSlurmctld=/etc/slurm/prolog.sh NODES="kraken-a kraken-a-2gpu kraken-a-4gpu leviathan-a \ leviathan-a-2gpu leviathan-a-4gpu" for N in $NODES;do # shut down other alias nodes if [ $N != $SLURM_JOB_NODELIST ]; then /bin/scontrol update NodeName=$N State=DRAIN Reason=”” fi done
The epilog script is as follows:
#!/bin/bash # slurm.conf: EpilogSlurmctld=/etc/slurm/epilog.sh NODES="kraken-a kraken-a-2gpu kraken-a-4gpu leviathan-a leviathan-a-2gpu leviathan-a-4gpu" /bin/echo "`date` In epilog " >>/tmp/sout.log /bin/echo "`date` require node $SLURM_JOB_NODELIST " >> /tmp/sout.log for N in $NODES;do if [ $N != $SLURM_JOB_NODELIST ]; then /bin/echo "`date` undraining $N" >>/tmp/sout.log /bin/scontrol update NodeName=$N State=IDLE fi done
There is some potential to optimize this configuration. Once the current node is configured, the second node has a usable configuration and does not have to be taken out of use. 3.6 Summary of Job Flow
The diagram in Fig. 2 indicates the complete sbatch job flow for both the director node and a worker node. The valid nodes and queues are shown in Table 1.
Queue | Machine | GPUs |
---|---|---|
normal | kraken-a, leviathan-a | 0 |
2gpu | kraken-a-2gpu, leviathan-a-2gpu | 2 |
4gpu | kraken-a-4gpu, leviathan-a-4gpu | 4 |
Table 1. Slurm alias nodes Queues
The possible combinations are (two nodes at any one time):
- kraken-a, leviathan-a-4gpu
- kraken-a-4gpu, leviathan-a
- kraken-a-2gpu, leviathan-a-2-gpu