4 Conclusions

The tests and integration concepts presented here are based on a "first look" at the hardware. More investigation is needed with additional hardware. It is, however, possible to draw some initial conclusions.
  1. Composing systems for HPC seems to work. There does not seem to be a loss of performance with the addition of GPU based resources.
  2. Integration with existing resource schedulers (e.g., Slurm) seems possible, however, more work is needed to create a production ready environment. This "masquerade" approach lets users think about machines and not configuration when running jobs.
  3. In terms of using a scheduler to configure the PCIe fabric, more investigation into safe switch reconfiguration is needed. E.g., making sure that a new PCIe configuration does not change any other node's PCIe configuration while it is running. This PoC did not address this issue.
  4. While rebooting servers does work, server boot times can be annoyingly long. In addition, some sites prefer to not reboot servers unless absolutely necessary. This may limit some of the methods explored here. It is expected when a rapid and standard PCIe bus rescan is available, this will remove the need to reboot systems and make scripts like the Slurm suspend and resume much more efficient.

