diff --git a/contribs/README b/contribs/README index d9ead0d45690ed968dce5d15fd22175ccc5a9491..0727620ab2e83e43a13332a0e2e3f78b7136686f 100644 --- a/contribs/README +++ b/contribs/README @@ -96,6 +96,7 @@ of the SLURM contribs distribution follows: sgi/ [Tools for use on SGI systems] netloc_to_topology.c [ C program ] Used to construct a Slurm topology.conf file based upon SGI network APIs. + README.txt [Documentation] sjobexit/ [ Perl programs ] Tools for managing job exit code records diff --git a/contribs/sgi/Makefile.am b/contribs/sgi/Makefile.am index 1db6f890314cc33a199f9ef71e099d5190971f7b..28c06b8bae7c439465246dc7ff39723fd11b2490 100644 --- a/contribs/sgi/Makefile.am +++ b/contribs/sgi/Makefile.am @@ -5,4 +5,5 @@ AUTOMAKE_OPTIONS = foreign EXTRA_DIST = \ - netloc_to_topology.c + netloc_to_topology.c \ + README.txt diff --git a/contribs/sgi/Makefile.in b/contribs/sgi/Makefile.in index 34b32dcc607a4b965de9bb3e1588d0cac810d09e..08dde19bd7ef9c43fe0f740483a9b3d5fea0374b 100644 --- a/contribs/sgi/Makefile.in +++ b/contribs/sgi/Makefile.in @@ -393,7 +393,8 @@ top_builddir = @top_builddir@ top_srcdir = @top_srcdir@ AUTOMAKE_OPTIONS = foreign EXTRA_DIST = \ - netloc_to_topology.c + netloc_to_topology.c \ + README.txt all: all-am diff --git a/contribs/sgi/README.txt b/contribs/sgi/README.txt new file mode 100644 index 0000000000000000000000000000000000000000..dbb83e0a805a6bc8f3d8069e2ee27f78937cca54 --- /dev/null +++ b/contribs/sgi/README.txt @@ -0,0 +1,56 @@ +Copyright (C) 2014 Silicon Graphics International Corp. +All rights reserved. + +The SGI hypercube topology plugin for SLURM enables SLURM to understand the +hypercube topologies on some SGI ICE InfiniBand clusters. With this +understanding about where nodes are physically located in relation to each +other, SLURM can make better decisions about which sets of nodes to allocate to +jobs. + +The plugin requires a properly set up topology.conf file. This is built using +the contribs/sgi/netloc_to_topology program which in turn uses the OpenMPI +group's netloc and hwloc tools. Please execute the following steps: + +1) Ensure that hwloc and netloc are installed on every node in your cluster + +2) Create a temporary directory in a shared filesystem available to each node + in your cluster. In this example we'll call it /data/slurm/cluster_data/. + +3) Create a subdirectory called hwloc, ie. /data/slurm/cluster_data/hwloc/. + +4) Create the following script in /data/slurm/cluster_data/create.sh + #!/bin/sh + HN=`hostname` + hwloc-ls /data/slurm/cluster_data/hwloc/$HN.xml + +5) Run the script on each compute node + $ cexec /data/slurm/cluster_data/create.sh + +6) Ensure that hwloc output files got put into /data/slurm/cluster_data/hwloc/. + If you have any nodes down right now, their missing data may cause you + problems later. + +7) Run netloc discovery on the primary InfiniBand fabric + $ cd /data/slurm/cluster_data/ + $ netloc_ib_gather_raw --out-dir ib-raw --sudo --force-subnet mlx4_0:1 + $ netloc_ib_extract_dats + +8) Run netloc_to_topology to turn the netloc and hwloc data into a SLURM + topology.conf. + $ netloc_to_topology -d /data/slurm/cluster_data/ + netloc_to_topology assumes a InfiniBand fabric ID of "fe80:0000:0000:0000". + If you have a different fabric ID, then you'll need to specify it with the + "-f" option. You can find the fabric ID with `ibv_devinfo -v`. E.g. + $ ibv_devinfo -v + Look down the results and for the HCA and port that you want to key off of, + look at its GID field. E.g. + GID[ 0]: fec0:0000:0000:0000:f452:1403:0047:36d1 + Use the first four couplets: + $ netloc_to_topology -d /data/slurm/cluster_data/ -f fec0:0000:0000:0000 + +9) Copy the resulting topology.conf file into SLURM's location for configuration + files. The following command copies it to the compute nodes. Make sure to + copy it to the node(s) running slurmctld as well. + $ cpush topology.conf /etc/slurm/topology.conf + +10) Restart SLURM