diff --git a/doc.zih.tu-dresden.de/docs/access/desktop_cloud_visualization.md b/doc.zih.tu-dresden.de/docs/access/desktop_cloud_visualization.md index 7395aad287f5c197ae8ba639491c493e87f2ffe9..e3fe6e8f25e5a59c876454807410c05c2494f8d3 100644 --- a/doc.zih.tu-dresden.de/docs/access/desktop_cloud_visualization.md +++ b/doc.zih.tu-dresden.de/docs/access/desktop_cloud_visualization.md @@ -29,7 +29,7 @@ Click on the `DCV` button. A new tab with the DCV client will be opened. - Check GPU support via: ```console hl_lines="4" -marie@compute$ glxinfo +marie@compute$ glxinfo | head name of display: :1 display: :1 screen: 0 direct rendering: Yes diff --git a/doc.zih.tu-dresden.de/docs/access/jupyterhub_for_teaching.md b/doc.zih.tu-dresden.de/docs/access/jupyterhub_for_teaching.md index d3ab18892984458d53b3b55bbcf5ce70d6592a51..84367bda4f78961c8c4675bb0b883a0c0f9b0e74 100644 --- a/doc.zih.tu-dresden.de/docs/access/jupyterhub_for_teaching.md +++ b/doc.zih.tu-dresden.de/docs/access/jupyterhub_for_teaching.md @@ -73,7 +73,7 @@ The spawn form now offers a quick start mode by passing URL parameters. !!! example The following link would create a jupyter notebook session on the - `interactive` partition with the `test` environment being loaded: + partition `interactive` with the `test` environment being loaded: ``` https://taurus.hrsk.tu-dresden.de/jupyter/hub/spawn#/~(partition~'interactive~environment~'test) diff --git a/doc.zih.tu-dresden.de/docs/contrib/content_rules.md b/doc.zih.tu-dresden.de/docs/contrib/content_rules.md index 5afcf96350ddf28981dc651cadd8381f06b4bc6c..1b1d5b460d78b65f5f8516b827e06e7782480fe8 100644 --- a/doc.zih.tu-dresden.de/docs/contrib/content_rules.md +++ b/doc.zih.tu-dresden.de/docs/contrib/content_rules.md @@ -394,12 +394,12 @@ This should help to avoid errors. | Localhost | `marie@local$` | | Login nodes | `marie@login$` | | Arbitrary compute node | `marie@compute$` | -| `haswell` partition | `marie@haswell$` | -| `ml` partition | `marie@ml$` | -| `alpha` partition | `marie@alpha$` | -| `romeo` partition | `marie@romeo$` | -| `julia` partition | `marie@julia$` | -| `dcv` partition | `marie@dcv$` | +| Partition `haswell` | `marie@haswell$` | +| Partition `ml` | `marie@ml$` | +| Partition `alpha` | `marie@alpha$` | +| Partition `romeo` | `marie@romeo$` | +| Partition `julia` | `marie@julia$` | +| Partition `dcv` | `marie@dcv$` | * **Always use a prompt**, even if there is no output provided for the shown command. * All code blocks which specify some general command templates, e.g. containing `<` and `>` diff --git a/doc.zih.tu-dresden.de/docs/index.md b/doc.zih.tu-dresden.de/docs/index.md index 601c8c31960805b91160aff1af15a2ad18ebbb5f..c2b122e008d51f7826aa784f6adbb727e5c4297a 100644 --- a/doc.zih.tu-dresden.de/docs/index.md +++ b/doc.zih.tu-dresden.de/docs/index.md @@ -25,11 +25,11 @@ Please also find out the other ways you could contribute in our [guidelines how **2022-01-13** [Supercomputing extension for TU Dresden](https://tu-dresden.de/zih/die-einrichtung/news/supercomputing-cluster-2022) -**2021-09-29** Introduction to HPC at ZIH ([HPC introduction slides](misc/HPC-Introduction.pdf)) - ## Training and Courses We offer a rich and colorful bouquet of courses from classical *HPC introduction* to various *Performance Analysis* and *Machine Learning* trainings. Please refer to the page [Training Offers](https://tu-dresden.de/zih/hochleistungsrechnen/nhr-training) for a detailed overview of the courses and the respective dates at ZIH. + +* [HPC introduction slides](misc/HPC-Introduction.pdf) (Sep. 2021) diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md index dadc94855ecc71a229e0ab19b15b6837f2bbf872..561d0ed622ae6b514866404a38ee5bc7d2f0c4ba 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md @@ -20,7 +20,7 @@ It has 34 nodes, each with: ### Modules The easiest way is using the [module system](../software/modules.md). -The software for the partition alpha is available in `modenv/hiera` module environment. +The software for the partition `alpha` is available in module environment `modenv/hiera`. To check the available modules for `modenv/hiera`, use the command diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/mpi_issues.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/mpi_issues.md new file mode 100644 index 0000000000000000000000000000000000000000..bcabe289e3390e0ea6915ef33fb05eb1cff97fef --- /dev/null +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/mpi_issues.md @@ -0,0 +1,44 @@ +# Known MPI-Usage Issues + +This pages holds known issues observed with MPI and concrete MPI implementations. + +## Mpirun on partition `alpha`and `ml` + +Using `mpirun` on partitions `alpha` and `ml` leads to wrong resource distribution when more than +one node is involved. This yields a strange distribution like e.g. `SLURM_NTASKS_PER_NODE=15,1` +even though `--tasks-per-node=8` was specified. Unless you really know what you're doing (e.g. +use rank pinning via perl script), avoid using mpirun. + +Another issue arises when using the Intel toolchain: mpirun calls a different MPI and caused a +8-9x slowdown in the PALM app in comparison to using srun or the GCC-compiled version of the app +(which uses the correct MPI). + +## R Parallel Library on Multiple Nodes + +Using the R parallel library on MPI clusters has shown problems when using more than a few compute +nodes. The error messages indicate that there are buggy interactions of R/Rmpi/OpenMPI and UCX. +Disabling UCX has solved these problems in our experiments. + +We invoked the R script successfully with the following command: + +```console +mpirun -mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx -np 1 Rscript +--vanilla the-script.R +``` + +where the arguments `-mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx` disable usage of +UCX. + +## MPI Function `MPI_Win_allocate` + +The function `MPI_Win_allocate` is a one-sided MPI call that allocates memory and returns a window +object for RDMA operations (ref. [man page](https://www.open-mpi.org/doc/v3.0/man3/MPI_Win_allocate.3.php)). + +> Using MPI_Win_allocate rather than separate MPI_Alloc_mem + MPI_Win_create may allow the MPI +> implementation to optimize the memory allocation. (Using advanced MPI) + +It was observed for at least for the `OpenMPI/4.0.5` module that using `MPI_Win_Allocate` instead of +`MPI_Alloc_mem` in conjunction with `MPI_Win_create` leads to segmentation faults in the calling +application . To be precise, the segfaults occurred at partition `romeo` when about 200 GB per node +where allocated. In contrast, the segmentation faults vanished when the implementation was +refactored to call the `MPI_Alloc_mem + MPI_Win_create` functions. diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm.md index 5da3c69156a571057d9aa9dbf085b2d87e1c2063..7043d46041314aa175bdfb401fb3129e529903c2 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/slurm.md @@ -1,33 +1,15 @@ # Batch System Slurm +ZIH uses the batch system Slurm for resource management and job scheduling. Compute nodes are not +accessed directly, but addressed through Slurm. You specify the needed resources +(cores, memory, GPU, time, ...) and Slurm will schedule your job for execution. + When logging in to ZIH systems, you are placed on a login node. There, you can manage your [data life cycle](../data_lifecycle/overview.md), setup experiments, and edit and prepare jobs. The login nodes are not suited for computational work! From the login nodes, you can interact with the batch system, e.g., submit and monitor your jobs. -A typical workflow would look like this: - -```mermaid -sequenceDiagram - user ->>+ login node: run programm - login node ->> login node: kill after 5 min - login node ->>- user: Killed! - user ->> login node: salloc [...] - login node ->> Slurm: Request resources - Slurm ->> user: resources - user ->>+ allocated resources: srun [options] [command] - allocated resources ->> allocated resources: run command (on allocated nodes) - allocated resources ->>- user: program finished - user ->>+ allocated resources: srun [options] [further_command] - allocated resources ->> allocated resources: run further command - allocated resources ->>- user: program finished - user ->>+ allocated resources: srun [options] [further_command] - allocated resources ->> allocated resources: run further command - Slurm ->> allocated resources: Job limit reached/exceeded - allocated resources ->>- user: Job limit reached -``` - ??? note "Batch System" The batch system is the central organ of every HPC system users interact with its compute @@ -36,6 +18,28 @@ sequenceDiagram for your job, the batch system allocates and connects to these resources, transfers runtime environment, and starts the job. + A workflow could look like this: + + ```mermaid + sequenceDiagram + user ->>+ login node: run programm + login node ->> login node: kill after 5 min + login node ->>- user: Killed! + user ->> login node: salloc [...] + login node ->> Slurm: Request resources + Slurm ->> user: resources + user ->>+ allocated resources: srun [options] [command] + allocated resources ->> allocated resources: run command (on allocated nodes) + allocated resources ->>- user: program finished + user ->>+ allocated resources: srun [options] [further_command] + allocated resources ->> allocated resources: run further command + allocated resources ->>- user: program finished + user ->>+ allocated resources: srun [options] [further_command] + allocated resources ->> allocated resources: run further command + Slurm ->> allocated resources: Job limit reached/exceeded + allocated resources ->>- user: Job limit reached + ``` + ??? note "Batch Job" At HPC systems, computational work and resource requirements are encapsulated into so-called @@ -50,10 +54,6 @@ sequenceDiagram Moreover, the [runtime environment](../software/overview.md) as well as the executable and certain command-line arguments have to be specified to run the computational work. -ZIH uses the batch system Slurm for resource management and job scheduling. -Just specify the resources you need in terms -of cores, memory, and time and your Slurm will place your job on the system. - This page provides a brief overview on * [Slurm options](#options) to specify resource requirements, @@ -74,31 +74,43 @@ information: There are three basic Slurm commands for job submission and execution: -1. `srun`: Submit a job for execution or initiate job steps in real time. +1. `srun`: Run a parallel application (and, if necessary, allocate resources first). 1. `sbatch`: Submit a batch script to Slurm for later execution. -1. `salloc`: Obtain a Slurm job allocation (a set of nodes), execute a command, and then release the - allocation when the command is finished. +1. `salloc`: Obtain a Slurm job allocation (i.e., resources like CPUs, nodes and GPUs) for +interactive use. Release the allocation when finished. Using `srun` directly on the shell will be blocking and launch an [interactive job](#interactive-jobs). Apart from short test runs, it is recommended to submit your jobs to Slurm for later execution by using [batch jobs](#batch-jobs). For that, you can conveniently -put the parameters directly in a [job file](#job-files), which you can submit using `sbatch +put the parameters in a [job file](#job-files), which you can submit using `sbatch [options] <job file>`. -At runtime, the environment variable `SLURM_JOB_ID` is set to the id of your job. The job -id is unique. The id allows you to [manage and control](#manage-and-control-jobs) your jobs. +After submission, your job gets a unique job ID, which is stored in the environment variable +`SLURM_JOB_ID` at job runtime. The command `sbatch` outputs the job ID to stderr. Furthermore, you +can find it via `squeue --me`. The job ID allows you to +[manage and control](#manage-and-control-jobs) your jobs. + +!!! warning "srun vs. mpirun" + + On ZIH systems, `srun` is used to run your parallel application. The use of `mpirun` is provenly + broken on partitions `ml` and `alpha` for jobs requiring more than one node. Especially when + using code from github projects, double-check it's configuration by looking for a line like + 'submit command mpirun -n $ranks ./app' and replace it with 'srun ./app'. + + Otherwise, this may lead to wrong resource distribution and thus job failure, or tremendous + slowdowns of your application. ## Options -The following table contains the most important options for `srun/sbatch/salloc` to specify resource -requirements and control communication. +The following table contains the most important options for `srun`, `sbatch`, `salloc` to specify +resource requirements and control communication. -??? tip "Options Table" +??? tip "Options Table (see `man sbatch`)" | Slurm Option | Description | |:---------------------------|:------------| - | `-n, --ntasks=<N>` | Number of (MPI) tasks (default: 1) | - | `-N, --nodes=<N>` | Number of nodes; there will be `--ntasks-per-node` processes started on each node | + | `-n, --ntasks=<N>` | Total number of (MPI) tasks (default: 1) | + | `-N, --nodes=<N>` | Number of compute nodes | | `--ntasks-per-node=<N>` | Number of tasks per allocated node to start (default: 1) | | `-c, --cpus-per-task=<N>` | Number of CPUs per task; needed for multithreaded (e.g. OpenMP) jobs; typically `N` should be equal to `OMP_NUM_THREADS` | | `-p, --partition=<name>` | Type of nodes where you want to execute your job (refer to [partitions](partitions_and_limits.md)) | @@ -115,6 +127,7 @@ requirements and control communication. | `-a, --array=<arg>` | Submit an array job ([examples](slurm_examples.md#array-jobs)) | | `-w <node1>,<node2>,...` | Restrict job to run on specific nodes only | | `-x <node1>,<node2>,...` | Exclude specific nodes from job | + | `--test-only` | Retrieve estimated start time of a job considering the job queue; does not actually submit the job nor run the application | !!! note "Output and Error Files" @@ -140,11 +153,11 @@ Interactive activities like editing, compiling, preparing experiments etc. are n the login nodes. For longer interactive sessions, you can allocate cores on the compute node with the command `salloc`. It takes the same options as `sbatch` to specify the required resources. -`salloc` returns a new shell on the node, where you submitted the job. You need to use the command +`salloc` returns a new shell on the node where you submitted the job. You need to use the command `srun` in front of the following commands to have these commands executed on the allocated resources. If you allocate more than one task, please be aware that `srun` will run the command on each allocated task by default! To release the allocated resources, invoke the command `exit` or -`scancel <jobid`. +`scancel <jobid>`. ```console marie@login$ salloc --nodes=2 @@ -188,22 +201,22 @@ taurusi6604.taurus.hrsk.tu-dresden.de The [module commands](../software/modules.md) are made available by sourcing the files `/etc/profile` and `~/.bashrc`. This is done automatically by passing the parameter `-l` to your shell, as shown in the example above. If you missed adding `-l` at submitting the interactive - session, no worry, you can source this files also later on manually. + session, no worry, you can source this files also later on manually (`source /etc/profile`). !!! note "Partition `interactive`" - A dedicated partition `interactive` is reserved for short jobs (< 8h) with not more than one job - per user. Please check the availability of nodes there with `sinfo --partition=interactive`. + A dedicated partition `interactive` is reserved for short jobs (< 8h) with no more than one job + per user. An interactive partition is available for every regular partition, e.g. + `alpha-interactive` for `alpha`. Please check the availability of nodes there with + `sinfo |grep 'interactive\|AVAIL' |less` ### Interactive X11/GUI Jobs Slurm will forward your X11 credentials to the first (or even all) node for a job with the -(undocumented) `--x11` option. For example, an interactive session for one hour with Matlab using -eight cores can be started with: +(undocumented) `--x11` option. ```console -marie@login$ module load MATLAB -marie@login$ srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first matlab +marie@login$ srun --ntasks=1 --pty --x11=first xeyes ``` !!! hint "X11 error" @@ -222,18 +235,18 @@ marie@login$ srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first ## Batch Jobs Working interactively using `srun` and `salloc` is a good starting point for testing and compiling. -But, as soon as you leave the testing stage, we highly recommend you to use batch jobs. +But, as soon as you leave the testing stage, we highly recommend to use batch jobs. Batch jobs are encapsulated within [job files](#job-files) and submitted to the batch system using `sbatch` for later execution. A job file is basically a script holding the resource requirements, environment settings and the commands for executing the application. Using batch jobs and job files -has multiple advantages: +has multiple advantages*: * You can reproduce your experiments and work, because all steps are saved in a file. * You can easily share your settings and experimental setup with colleagues. -* You can submit your job file to the scheduling system for later execution. In the meanwhile, you can - grab a coffee and proceed with other work (e.g., start writing a paper). -!!! hint "The syntax for submitting a job file to Slurm is" +*) If job files are version controlled or environment `env` is saved along with Slurm output. + +!!! hint "Syntax: Submitting a batch job" ```console marie@login$ sbatch [options] <job_file> @@ -247,15 +260,15 @@ Job files have to be written with the following structure. #!/bin/bash # ^Batch script starts with shebang line -#SBATCH --ntasks=24 # All #SBATCH lines have to follow uninterrupted -#SBATCH --time=01:00:00 # after the shebang line -#SBATCH --account=<KTR> # Comments start with # and do not count as interruptions -#SBATCH --job-name=fancyExp -#SBATCH --output=simulation-%j.out -#SBATCH --error=simulation-%j.err +#SBATCH --ntasks=24 # #SBATCH lines request resources and +#SBATCH --time=01:00:00 # specify Slurm options +#SBATCH --account=<KTR> # +#SBATCH --job-name=fancyExp # All #SBATCH lines have to follow uninterrupted +#SBATCH --output=simulation-%j.out # after the shebang line +#SBATCH --error=simulation-%j.err # Comments start with # and do not count as interruptions -module purge # Set up environment, e.g., clean modules environment -module load <modules> # and load necessary modules +module purge # Set up environment, e.g., clean/switch modules environment +module load <module1 module2> # and load necessary modules srun ./application [options] # Execute parallel application with srun ``` @@ -361,9 +374,10 @@ On the command line, use `squeue` to watch the scheduling queue. Invoke `squeue --me` to list only your jobs. -The command `squeue` will tell the reason, why a job is not running (job status in the last column -of the output). More information about job parameters can also be determined with `scontrol -d show -job <jobid>`. The following table holds detailed descriptions of the possible job states: +In it's last column, the `squeue` command will also tell why a job is not running. +Possible reasons and their detailed descriptions are listed in the following table. +More information about job parameters can be obtained with `scontrol -d show +job <jobid>`. ??? tip "Reason Table" @@ -384,8 +398,6 @@ job <jobid>`. The following table holds detailed descriptions of the possible jo | `TimeLimit` | The job exhausted its time limit. | | `InactiveLimit` | The job reached the system inactive limit. | -In addition, the `sinfo` command gives you a quick status overview. - For detailed information on why your submitted job has not started yet, you can use the command ```console @@ -394,7 +406,7 @@ marie@login$ whypending <jobid> ### Editing Jobs -Jobs that have not yet started can be altered. Using `scontrol update timelimit=4:00:00 +Jobs that have not yet started can be altered. By using `scontrol update timelimit=4:00:00 jobid=<jobid>`, it is for example possible to modify the maximum runtime. `scontrol` understands many different options, please take a look at the [scontrol documentation](https://slurm.schedmd.com/scontrol.html) for more details. @@ -404,62 +416,70 @@ many different options, please take a look at the The command `scancel <jobid>` kills a single job and removes it from the queue. By using `scancel -u <username>`, you can send a canceling signal to all of your jobs at once. -### Accounting +### Evaluating Jobs The Slurm command `sacct` provides job statistics like memory usage, CPU time, energy usage etc. +as table-formatted output on the command line. + +The job monitor [PIKA](../software/pika.md) provides web-based graphical performance statistics +at no extra cost. !!! hint "Learn from old jobs" - We highly encourage you to use `sacct` to learn from you previous jobs in order to better + We highly encourage you to inspect your previous jobs in order to better estimate the requirements, e.g., runtime, for future jobs. + With PIKA, it is e.g. easy to check whether a job is hanging, idling, + or making good use of the resources. -`sacct` outputs the following fields by default. +??? tip "Using sacct (see also `man sacct`)" + `sacct` outputs the following fields by default. -```console -# show all own jobs contained in the accounting database -marie@login$ sacct - JobID JobName Partition Account AllocCPUS State ExitCode ------------- ---------- ---------- ---------- ---------- ---------- -------- -[...] -``` + ```console + # show all own jobs contained in the accounting database + marie@login$ sacct + JobID JobName Partition Account AllocCPUS State ExitCode + ------------ ---------- ---------- ---------- ---------- ---------- -------- + [...] + ``` -We'd like to point your attention to the following options to gain insight in your jobs. + We'd like to point your attention to the following options to gain insight in your jobs. -??? example "Show specific job" + ??? example "Show specific job" - ```console - marie@login$ sacct --jobs=<JOBID> - ``` + ```console + marie@login$ sacct --jobs=<JOBID> + ``` -??? example "Show all fields for a specific job" + ??? example "Show all fields for a specific job" - ```console - marie@login$ sacct --jobs=<JOBID> --format=All - ``` + ```console + marie@login$ sacct --jobs=<JOBID> --format=All + ``` -??? example "Show specific fields" + ??? example "Show specific fields" - ```console - marie@login$ sacct --jobs=<JOBID> --format=JobName,MaxRSS,MaxVMSize,CPUTime,ConsumedEnergy - ``` + ```console + marie@login$ sacct --jobs=<JOBID> --format=JobName,MaxRSS,MaxVMSize,CPUTime,ConsumedEnergy + ``` -The manual page (`man sacct`) and the [sacct online reference](https://slurm.schedmd.com/sacct.html) -provide a comprehensive documentation regarding available fields and formats. + The manual page (`man sacct`) and the [sacct online reference](https://slurm.schedmd.com/sacct.html) + provide a comprehensive documentation regarding available fields and formats. -!!! hint "Time span" + !!! hint "Time span" - By default, `sacct` only shows data of the last day. If you want to look further into the past - without specifying an explicit job id, you need to provide a start date via the option - `--starttime` (or short: `-S`). A certain end date is also possible via `--endtime` (or `-E`). + By default, `sacct` only shows data of the last day. If you want to look further into the past + without specifying an explicit job id, you need to provide a start date via the option + `--starttime` (or short: `-S`). A certain end date is also possible via `--endtime` (or `-E`). -??? example "Show all jobs since the beginning of year 2021" + ??? example "Show all jobs since the beginning of year 2021" - ```console - marie@login$ sacct -S 2021-01-01 [-E now] - ``` + ```console + marie@login$ sacct -S 2021-01-01 [-E now] + ``` ## Jobs at Reservations +Within a reservation, you have privileged access to HPC resources. How to ask for a reservation is described in the section [reservations](overview.md#exclusive-reservation-of-hardware). After we agreed with your requirements, we will send you an e-mail with your reservation name. Then, @@ -471,7 +491,7 @@ marie@login$ scontrol show res=<reservation name> ``` If you want to use your reservation, you have to add the parameter -`--reservation=<reservation name>` either in your sbatch script or to your `srun` or `salloc` command. +`--reservation=<reservation name>` either in your job script or to your `srun` or `salloc` command. ## Node Features for Selective Job Submission diff --git a/doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md b/doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md index 4bd9634db24b8ba81a02368a4f51c0b46004885f..47c7567b1a063a4b67cca2982d53bf729b288295 100644 --- a/doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md +++ b/doc.zih.tu-dresden.de/docs/software/big_data_frameworks.md @@ -41,7 +41,7 @@ The Spark and Flink modules are available in both `scs5` and `ml` environments. Thus, Spark and Flink can be executed using different CPU architectures, e.g., Haswell and Power9. Let us assume that two nodes should be used for the computation. Use a `srun` command similar to -the following to start an interactive session using the partition haswell. The following code +the following to start an interactive session using the partition `haswell`. The following code snippet shows a job submission to haswell nodes with an allocation of two nodes with 60000 MB main memory exclusively for one hour: diff --git a/doc.zih.tu-dresden.de/docs/software/cfd.md b/doc.zih.tu-dresden.de/docs/software/cfd.md index b9f1556c18e4e7e57685cb523c459435154cbabe..7cb4b2521eaddee6e7997b2fab109e31a7e4c5bf 100644 --- a/doc.zih.tu-dresden.de/docs/software/cfd.md +++ b/doc.zih.tu-dresden.de/docs/software/cfd.md @@ -88,7 +88,7 @@ To use fluent interactively, please try: ```console marie@login$ module load ANSYS/19.2 marie@login$ srun --nodes=1 --cpus-per-task=4 --time=1:00:00 --pty --x11=first bash -marie@login$ fluent & +marie@compute$ fluent & ``` ## STAR-CCM+ diff --git a/doc.zih.tu-dresden.de/docs/software/custom_easy_build_environment.md b/doc.zih.tu-dresden.de/docs/software/custom_easy_build_environment.md index e9283d6d8063bbc9dc6d4c2bd520d9dc96f341b1..9232e7472e8acc0254f876352310be0355d9aa4e 100644 --- a/doc.zih.tu-dresden.de/docs/software/custom_easy_build_environment.md +++ b/doc.zih.tu-dresden.de/docs/software/custom_easy_build_environment.md @@ -61,7 +61,7 @@ marie@login$ ws_list | grep 'directory.*EasyBuild' put commands in a batch file and source it. The latter is recommended for non-interactive jobs, using the command `sbatch` instead of `srun`. For the sake of illustration, we use an interactive job as an example. Depending on the partitions that you want the module to be usable on -later, you need to select nodes with the same architecture. Thus, use nodes from partition ml for +later, you need to select nodes with the same architecture. Thus, use nodes from partition `ml` for building, if you want to use the module on nodes of that partition. In this example, we assume that we want to use the module on nodes with x86 architecture and thus, we use Haswell nodes. @@ -80,14 +80,14 @@ environment variable called `WORKSPACE` with the path to your workspace: marie@compute$ export WORKSPACE=/scratch/ws/1/marie-EasyBuild #see output of ws_list above ``` -**Step 4:** Load the correct module environment `modenv` according to your current or target +**Step 4:** Load the correct module environment `modenv` according to your current or target architecture: -=== "x86 (default, e. g. partition haswell)" +=== "x86 (default, e. g. partition `haswell`)" ```console marie@compute$ module load modenv/scs5 ``` -=== "Power9 (partition ml)" +=== "Power9 (partition `ml`)" ```console marie@ml$ module load modenv/ml ``` diff --git a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md index def2bde95ab679cc6c41c222d80d48534a2301bf..38d198969801a913287d92ffc300b0447bfacddb 100644 --- a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md +++ b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_python.md @@ -85,7 +85,7 @@ pandarallel module. If the pandarallel module is not installed already, use a df.parallel_apply(func=transform, axis=1) ``` For more examples of using pandarallel check out -[https://github.com/nalepae/pandarallel/blob/master/docs/examples.ipynb](https://github.com/nalepae/pandarallel/blob/master/docs/examples.ipynb). +[https://github.com/nalepae/pandarallel/blob/master/docs/examples.ipynb](https://github.com/nalepae/pandarallel/blob/master/docs/examples_mac_linux.ipynb). ### Dask diff --git a/doc.zih.tu-dresden.de/docs/software/debuggers.md b/doc.zih.tu-dresden.de/docs/software/debuggers.md index d57ceab704a534302ff24407e2c20bdce3dbd833..c96a38f99b06100965783757a8c33e6bcc79c65a 100644 --- a/doc.zih.tu-dresden.de/docs/software/debuggers.md +++ b/doc.zih.tu-dresden.de/docs/software/debuggers.md @@ -79,9 +79,9 @@ modified by DDT available, which has better support for Fortran 90 (e.g. derive - The more processes and nodes involved, the higher is the probability for timeouts or other problems - Debug with as few processes as required to reproduce the bug you want to find -- Module to load before using: `module load ddt` Start: `ddt <executable>` If the GUI runs too slow -- over your remote connection: - Use [WebVNC](../access/graphical_applications_with_webvnc.md) to start a remote desktop session in +- Module to load before using: `module load ddt` Start: `ddt <executable>` + - If the GUI runs too slow over your remote connection: Use +[WebVNC](../access/graphical_applications_with_webvnc.md) to start a remote desktop session in a web browser. - Slides from user training: [Parallel Debugging with DDT](misc/debugging_ddt.pdf) diff --git a/doc.zih.tu-dresden.de/docs/software/energy_measurement.md b/doc.zih.tu-dresden.de/docs/software/energy_measurement.md index 2f47043ecdd7021467b8b364731ba0cdbb154b1b..ac73235a27fefc8ea6dffb934b4439391a32cfff 100644 --- a/doc.zih.tu-dresden.de/docs/software/energy_measurement.md +++ b/doc.zih.tu-dresden.de/docs/software/energy_measurement.md @@ -103,12 +103,6 @@ functions with the component power consumption of the parallel application.  {: align="center"} -!!! note - - The power measurement modules `scorep-dataheap` and `scorep-hdeem` are dynamic and only - need to be loaded during execution. However, `scorep-hdeem` does require the application to - be linked with a certain version of Score-P. - By default, `scorep-dataheap` records all sensors that are available. Currently this is the total node consumption and the CPUs. `scorep-hdeem` also records all available sensors (node, 2x CPU, 4x DDR) by default. You can change the selected sensors by setting the environment diff --git a/doc.zih.tu-dresden.de/docs/software/fem_software.md b/doc.zih.tu-dresden.de/docs/software/fem_software.md index af6b9fb80986e2bc727ae88e97b2cca614ffd629..a225e0d61fbd97c699482be72e53ea36888da67c 100644 --- a/doc.zih.tu-dresden.de/docs/software/fem_software.md +++ b/doc.zih.tu-dresden.de/docs/software/fem_software.md @@ -96,11 +96,11 @@ start a Ansys workbench on the login nodes interactively for short tasks. The se Since the MPI library that Ansys uses internally (Platform MPI) has some problems integrating seamlessly with Slurm, you have to unset the enviroment variable `SLURM_GTIDS` in your - environment bevor running Ansysy workbench in interactive andbatch mode. + environment befor running Ansysy workbench in interactive and batch mode. ### Using Workbench Interactively -Ansys workbench (`runwb2`) an be invoked interactively on the login nodes of ZIH systems for short tasks. +Ansys workbench (`runwb2`) can be invoked interactively on the login nodes of ZIH systems for short tasks. [X11 forwarding](../access/ssh_login.md#x11-forwarding) needs to enabled when establishing the SSH connection. For OpenSSH the corresponding option is `-X` and it is valuable to use compression of all data via `-C`. @@ -122,7 +122,7 @@ marie@login$ # e.g. marie@login$ module load ANSYS/2020R2 marie@login$ srun --time=00:30:00 --x11=first [SLURM_OPTIONS] --pty bash [...] -marie@login$ runwb2 +marie@compute$ runwb2 ``` !!! hint "Better use DCV" @@ -161,7 +161,7 @@ parameter (for batch mode), `-F` for your project file, and can then either add # module load ANSYS/<version> # e.g. - module load ANSYS ANSYS/2020R2 + module load ANSYS/2020R2 runwb2 -B -F Workbench_Taurus.wbpj -E 'Project.Update' -E 'Save(Overwrite=True)' #or, if you wish to use a workbench replay file, replace the -E parameters with: -R mysteps.wbjn @@ -268,7 +268,7 @@ You need a job file (aka. batch script) to run the MPI version. #SBATCH --ntasks=16 # number of processor cores (i.e. tasks) #SBATCH --mem-per-cpu=1900M # memory per CPU core - module load ls-dyna + module load LS-DYNA srun mpp-dyna i=neon_refined01_30ms.k memory=120000000 ``` diff --git a/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md b/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md index 688ada0e2aabf973f545d54b1c15168de98aa912..885e617f3f47797acd8858e18c363807e77bde67 100644 --- a/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md +++ b/doc.zih.tu-dresden.de/docs/software/hyperparameter_optimization.md @@ -190,8 +190,8 @@ There are the following script preparation steps for OmniOpt: ``` 1. Testing script functionality and determine software requirements for the chosen - [partition](../jobs_and_resources/partitions_and_limits.md). In the following, the alpha - partition is used. Please note the parameters `--out-layer1`, `--batchsize`, `--epochs` when + [partition](../jobs_and_resources/partitions_and_limits.md). In the following, the + partition `alpha` is used. Please note the parameters `--out-layer1`, `--batchsize`, `--epochs` when calling the Python script. Additionally, note the `RESULT` string with the output for OmniOpt. ??? hint "Hint for installing Python modules" diff --git a/doc.zih.tu-dresden.de/docs/software/machine_learning.md b/doc.zih.tu-dresden.de/docs/software/machine_learning.md index e293b007a9c07fbaf41ba3ec7ce25f29024f44d7..825c83ba2a4fd0993a9b771d4d82758e9b0de20b 100644 --- a/doc.zih.tu-dresden.de/docs/software/machine_learning.md +++ b/doc.zih.tu-dresden.de/docs/software/machine_learning.md @@ -5,25 +5,25 @@ For machine learning purposes, we recommend to use the partitions `alpha` and/or ## Partition `ml` -The compute nodes of the partition ML are built on the base of +The compute nodes of the partition `ml` are built on the base of [Power9 architecture](https://www.ibm.com/it-infrastructure/power/power9) from IBM. The system was created for AI challenges, analytics and working with data-intensive workloads and accelerated databases. The main feature of the nodes is the ability to work with the [NVIDIA Tesla V100](https://www.nvidia.com/en-gb/data-center/tesla-v100/) GPU with **NV-Link** support that allows a total bandwidth with up to 300 GB/s. Each node on the -partition ML has 6x Tesla V-100 GPUs. You can find a detailed specification of the partition in our +partition `ml` has 6x Tesla V-100 GPUs. You can find a detailed specification of the partition in our [Power9 documentation](../jobs_and_resources/hardware_overview.md). !!! note - The partition ML is based on the Power9 architecture, which means that the software built + The partition `ml` is based on the Power9 architecture, which means that the software built for x86_64 will not work on this partition. Also, users need to use the modules which are specially build for this architecture (from `modenv/ml`). ### Modules -On the partition ML load the module environment: +On the partition `ml` load the module environment: ```console marie@ml$ module load modenv/ml @@ -32,19 +32,20 @@ The following have been reloaded with a version change: 1) modenv/scs5 => moden ### Power AI -There are tools provided by IBM, that work on partition ML and are related to AI tasks. +There are tools provided by IBM, that work on partition `ml` and are related to AI tasks. For more information see our [Power AI documentation](power_ai.md). ## Partition: Alpha -Another partition for machine learning tasks is Alpha. It is mainly dedicated to -[ScaDS.AI](https://scads.ai/) topics. Each node on Alpha has 2x AMD EPYC CPUs, 8x NVIDIA A100-SXM4 -GPUs, 1 TB RAM and 3.5 TB local space (`/tmp`) on an NVMe device. You can find more details of the -partition in our [Alpha Centauri](../jobs_and_resources/alpha_centauri.md) documentation. +Another partition for machine learning tasks is `alpha`. It is mainly dedicated to +[ScaDS.AI](https://scads.ai/) topics. Each node on partition `alpha` has 2x AMD EPYC CPUs, 8x NVIDIA +A100-SXM4 GPUs, 1 TB RAM and 3.5 TB local space (`/tmp`) on an NVMe device. You can find more +details of the partition in our [Alpha Centauri](../jobs_and_resources/alpha_centauri.md) +documentation. ### Modules -On the partition alpha load the module environment: +On the partition `alpha` load the module environment: ```console marie@alpha$ module load modenv/hiera @@ -53,7 +54,7 @@ The following have been reloaded with a version change: 1) modenv/ml => modenv/ !!! note - On partition Alpha, the most recent modules are build in `hiera`. Alternative modules might be + On partition `alpha`, the most recent modules are build in `hiera`. Alternative modules might be build in `scs5`. ## Machine Learning via Console @@ -82,7 +83,7 @@ create documents containing live code, equations, visualizations, and narrative TensorFlow or PyTorch) on ZIH systems and to run your Jupyter notebooks on HPC nodes. After accessing JupyterHub, you can start a new session and configure it. For machine learning -purposes, select either partition **Alpha** or **ML** and the resources, your application requires. +purposes, select either partition `alpha` or `ml` and the resources, your application requires. In your session you can use [Python](data_analytics_with_python.md#jupyter-notebooks), [R](data_analytics_with_r.md#r-in-jupyterhub) or [RStudio](data_analytics_with_rstudio.md) for your diff --git a/doc.zih.tu-dresden.de/docs/software/modules.md b/doc.zih.tu-dresden.de/docs/software/modules.md index 9cf35854e953b4d9c2f9380d16217dec91dfea2a..74f67821cac0c8b030b06079f86e2514030fa5d6 100644 --- a/doc.zih.tu-dresden.de/docs/software/modules.md +++ b/doc.zih.tu-dresden.de/docs/software/modules.md @@ -133,8 +133,8 @@ marie@compute$ module load modenv/ml ### modenv/ml -* data analytics software (for use on the partition ml) -* necessary to run most software on the partition ml +* data analytics software (for use on the partition `ml`) +* necessary to run most software on the partition `ml` (The instruction set [Power ISA](https://en.wikipedia.org/wiki/Power_ISA#Power_ISA_v.3.0) is different from the usual x86 instruction set. Thus the 'machine code' of other modenvs breaks). diff --git a/doc.zih.tu-dresden.de/docs/software/mpi_usage_error_detection.md b/doc.zih.tu-dresden.de/docs/software/mpi_usage_error_detection.md index a26a8c6ee9595129b32ee56db2040e7cbb11ca7a..b604bf5398681458ac416336ea7c42a0b3a25b15 100644 --- a/doc.zih.tu-dresden.de/docs/software/mpi_usage_error_detection.md +++ b/doc.zih.tu-dresden.de/docs/software/mpi_usage_error_detection.md @@ -79,8 +79,11 @@ MUST aware of this knowledge. Overhead is drastically reduced with this switch. After running your application with MUST you will have its output in the working directory of your application. The output is named `MUST_Output.html`. Open this files in a browser to analyze the -results. The HTML file is color coded: Entries in green represent notes and useful information. -Entries in yellow represent warnings, and entries in red represent errors. +results. The HTML file is color coded: + +- Entries in green represent notes and useful information +- Entries in yellow represent warnings +- Entries in red represent errors ## Further MPI Correctness Tools diff --git a/doc.zih.tu-dresden.de/docs/software/power_ai.md b/doc.zih.tu-dresden.de/docs/software/power_ai.md index 5d1c397ab00d66fe61fc41fb4cee1efaeb25801b..1488d85f9f6b4a749b5535dea59474a7c2cf36a4 100644 --- a/doc.zih.tu-dresden.de/docs/software/power_ai.md +++ b/doc.zih.tu-dresden.de/docs/software/power_ai.md @@ -5,7 +5,7 @@ the PowerAI Framework for Machine Learning. In the following the links are valid for PowerAI version 1.5.4. !!! warning - The information provided here is available from IBM and can be used on partition ml only! + The information provided here is available from IBM and can be used on partition `ml` only! ## General Overview diff --git a/doc.zih.tu-dresden.de/docs/software/python_virtual_environments.md b/doc.zih.tu-dresden.de/docs/software/python_virtual_environments.md index bde481a3284b4476c2a69f6690623ead7246dad3..129b1d9dd053415617e77f3abad603c2d6b68809 100644 --- a/doc.zih.tu-dresden.de/docs/software/python_virtual_environments.md +++ b/doc.zih.tu-dresden.de/docs/software/python_virtual_environments.md @@ -68,7 +68,7 @@ the environment as follows: ??? example - This is an example on partition Alpha. The example creates a python virtual environment, and + This is an example on partition `alpha`. The example creates a python virtual environment, and installs the package `torchvision` with pip. ```console marie@login$ srun --partition=alpha-interactive --nodes=1 --gres=gpu:1 --time=01:00:00 --pty bash @@ -179,7 +179,7 @@ can deactivate the conda environment as follows: ??? example - This is an example on partition Alpha. The example creates a conda virtual environment, and + This is an example on partition `alpha`. The example creates a conda virtual environment, and installs the package `torchvision` with conda. ```console marie@login$ srun --partition=alpha-interactive --nodes=1 --gres=gpu:1 --time=01:00:00 --pty bash diff --git a/doc.zih.tu-dresden.de/docs/software/tensorflow.md b/doc.zih.tu-dresden.de/docs/software/tensorflow.md index 746c78a39b21845b5164217390dcc141467345fa..58b99bd1c302c0ed65619fc200602f2732f84df1 100644 --- a/doc.zih.tu-dresden.de/docs/software/tensorflow.md +++ b/doc.zih.tu-dresden.de/docs/software/tensorflow.md @@ -17,13 +17,13 @@ to find out, which TensorFlow modules are available on your partition. On ZIH systems, TensorFlow 2 is the default module version. For compatibility hints between TensorFlow 2 and TensorFlow 1, see the corresponding [section below](#compatibility-tf2-and-tf1). -We recommend using partitions **Alpha** and/or **ML** when working with machine learning workflows +We recommend using partitions `alpha` and/or `ml` when working with machine learning workflows and the TensorFlow library. You can find detailed hardware specification in our [Hardware](../jobs_and_resources/hardware_overview.md) documentation. ## TensorFlow Console -On the partition Alpha, load the module environment: +On the partition `alpha`, load the module environment: ```console marie@alpha$ module load modenv/scs5 @@ -47,7 +47,7 @@ marie@alpha$ module avail TensorFlow [...] ``` -On the partition ML load the module environment: +On the partition `ml` load the module environment: ```console marie@ml$ module load modenv/ml @@ -74,9 +74,9 @@ import TensorFlow: [...] marie@ml$ which python #check which python are you using /sw/installed/Python/3.7.2-GCCcore-8.2.0 - marie@ml$ virtualenv --system-site-packages /scratch/ws/1/python_virtual_environment/env + marie@ml$ virtualenv --system-site-packages /scratch/ws/1/marie-python_virtual_environment/env [...] - marie@ml$ source /scratch/ws/1/python_virtual_environment/env/bin/activate + marie@ml$ source /scratch/ws/1/marie-python_virtual_environment/env/bin/activate marie@ml$ python -c "import tensorflow as tf; print(tf.__version__)" [...] 2.3.1 diff --git a/doc.zih.tu-dresden.de/docs/software/visualization.md b/doc.zih.tu-dresden.de/docs/software/visualization.md index f9de3764e32ca78381c7e22a38d3bda5fe6bdd32..6c68e9a1a5891b92934ae600c57f23bbf1ebd0df 100644 --- a/doc.zih.tu-dresden.de/docs/software/visualization.md +++ b/doc.zih.tu-dresden.de/docs/software/visualization.md @@ -44,7 +44,7 @@ MPI processes to hardware. A convenient option is `-bind-to core`. All other op obtained by ```console -marie@login$ mpiexec -bind-to -help` +marie@login$ mpiexec -bind-to -help ``` or from @@ -135,7 +135,7 @@ virtual desktop session, then load the ParaView module as usual and start the GU ```console marie@dcv$ module load ParaView/5.7.0 -paraview +marie@dcv$ paraview ``` Since your DCV session already runs inside a job, which has been scheduled to a compute node, no diff --git a/doc.zih.tu-dresden.de/mkdocs.yml b/doc.zih.tu-dresden.de/mkdocs.yml index 573ddba029c7f441a595dd69342280e646b4bb7f..79b56b4d83efb025a7b7e10d6f8c0df30c7815cf 100644 --- a/doc.zih.tu-dresden.de/mkdocs.yml +++ b/doc.zih.tu-dresden.de/mkdocs.yml @@ -81,6 +81,7 @@ nav: - Compilers and Flags: software/compilers.md - GPU Programming: software/gpu_programming.md - Mathematics Libraries: software/math_libraries.md + - MPI Usage Issues: jobs_and_resources/mpi_issues.md - Debugging: software/debuggers.md - Performance Engineering Tools: - Overview: software/performance_engineering_overview.md diff --git a/doc.zih.tu-dresden.de/util/check-code-style.sh b/doc.zih.tu-dresden.de/util/check-code-style.sh index 21dd8ef9ecd33304cb6f723e8a315dbc785d5849..64dce7c637e9909fd1ebd8cd07a289bfc151417b 100755 --- a/doc.zih.tu-dresden.de/util/check-code-style.sh +++ b/doc.zih.tu-dresden.de/util/check-code-style.sh @@ -29,7 +29,7 @@ files and all markdown files is done. EOF } -function style_check() { +function pattern_matches() { local any_fails any_fails=false @@ -79,9 +79,9 @@ function style_check() { fi fi if [[ "${any_fails}" == true ]]; then - return 1 + return 0 fi - return 0 + return 1 } # ----------------------------------------------------------------------------- # Functions End @@ -125,14 +125,14 @@ for file in $files; do # Variable expansion. Currently style check not possible for multiline comment pattern='.*"[\n\s\w\W]*\$[^\{|^\(]\w*[\n\s\w\W]*"' warning="Using \"\${var}\" is recommended over \"\$var\"" - if style_check "${file}" "${pattern}" "${warning}"; then + if pattern_matches "${file}" "${pattern}" "${warning}"; then any_fails=true fi # Declaration and assignment of local variables pattern='local [a-zA-Z_]*=.*' warning="Declaration and assignment of local variables should be on different lines." - if style_check "${file}" "${pattern}" "${warning}"; then + if pattern_matches "${file}" "${pattern}" "${warning}"; then any_fails=true fi @@ -142,7 +142,7 @@ for file in $files; do #echo "Checking for max line length..." pattern='^.{80}.*$' warning="Recommended maximum line length is 80 characters." - if style_check "${file}" "${pattern}" "${warning}"; then + if pattern_matches "${file}" "${pattern}" "${warning}"; then any_fails=true fi fi @@ -150,28 +150,28 @@ for file in $files; do # do, then in the same line as while, for and if pattern='^\s*(while|for|if)[\w\-\%\d\s\$=\[\]\(\)]*[^;]\s*[^do|then]\s*$' warning="It is recommended to put '; do' and '; then' on the same line as the 'while', 'for' or 'if'" - if style_check "${file}" "${pattern}" "${warning}"; then + if pattern_matches "${file}" "${pattern}" "${warning}"; then any_fails=true fi # using [[..]] over [..] pattern='^\s*(if|while|for)\s*\[[^\[].*$' warning="It is recommended to use '[[ … ]]' over '[ … ]', 'test' and '/usr/bin/['" - if style_check "${file}" "${pattern}" "${warning}"; then + if pattern_matches "${file}" "${pattern}" "${warning}"; then any_fails=true fi # Avoiding 'eval' pattern='^[\w\=\"\s\$\(]*eval.*' warning="It is not recommended to use eval" - if style_check "${file}" "${pattern}" "${warning}"; then + if pattern_matches "${file}" "${pattern}" "${warning}"; then any_fails=true fi # Arithmetic pattern='(\$\([^\(]|let|\$\[)\s*(expr|\w)\s*[\d\+\-\*\/\=\%\$]+' warning="It is recommended to use '(( … ))' or '\$(( … ))' rather than 'let' or '\$[ … ]' or 'expr'" - if style_check "${file}" "${pattern}" "${warning}"; then + if pattern_matches "${file}" "${pattern}" "${warning}"; then any_fails=true fi @@ -179,14 +179,14 @@ for file in $files; do # Function name pattern='^.*[A-Z]+[_a-z]*\s*\(\)\s*\{' warning="It is recommended to write function names in lower-case, with underscores to separate words" - if style_check "${file}" "${pattern}" "${warning}"; then + if pattern_matches "${file}" "${pattern}" "${warning}"; then any_fails=true fi # Constants and Environment Variable Names pattern='readonly [^A-Z]*=.*|declare [-a-zA-Z\s]*[^A-Z]*=.*' warning="Constants and anything exported to the environment should be capitalized." - if style_check "${file}" "${pattern}" "${warning}"; then + if pattern_matches "${file}" "${pattern}" "${warning}"; then any_fails=true fi done