diff --git a/.gitignore b/.gitignore index b9e65f1e880720dbee380c30294977f587de9994..ed9ec7dd5f3338e0cda169471c748dbdf5038a58 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,4 @@ *package-lock.json *package.json *node_modules +**venv/ \ No newline at end of file diff --git a/.markdownlintrc b/.markdownlintrc index 4be0c89503e39d697cdb47aec98cd0306a5bca5b..4a9cce8fa8c1ae5e3a08b42433b22a325eab0252 100644 --- a/.markdownlintrc +++ b/.markdownlintrc @@ -15,8 +15,10 @@ "single-trailing-newline": true, "blanks-around-fences": true, "blanks-around-lists": true, - "commands-show-output": true, + "commands-show-output": false, "line-length": { "line_length": 100, "code_blocks": false, "tables": false}, "no-missing-space-atx": true, - "no-multiple-space-atx": true + "no-multiple-space-atx": true, + "no-hard-tabs": true, + "no-trailing-spaces": true } diff --git a/README.md b/README.md index a4b94ed2c9208b0e143fb70702f7215728ebb754..05825be788b1d0e0d6436454e6aa0849d28d93c3 100644 --- a/README.md +++ b/README.md @@ -24,7 +24,8 @@ issues. ## Contributing -Contributions from user-side are highly welcome. Please refer to [Contribution guide]() to get started. +Contributions from user-side are highly welcome. Please refer to +[Contribution guide](doc.zih.tu-dresden.de/README.md) to get started. ## Licenses diff --git a/doc.zih.tu-dresden.de/docs/access/jupyterhub.md b/doc.zih.tu-dresden.de/docs/access/jupyterhub.md index a99dd6622472daae2cf4650cabc5dca8675fd129..6c5d86618e8e105143cfc6ad24cd954a10ce354c 100644 --- a/doc.zih.tu-dresden.de/docs/access/jupyterhub.md +++ b/doc.zih.tu-dresden.de/docs/access/jupyterhub.md @@ -60,7 +60,7 @@ the import/export feature (available through the button) to save your presets in text files. Note: the [<span style="color:blue">**alpha**</span>] -(https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/AlphaCentauri) +(https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/AlphaCentauri) partition is available only in the extended form. ## Applications @@ -107,7 +107,7 @@ create new notebooks, files, directories or terminals. ## The notebook -In JupyterHub you can create scripts in notebooks. +In JupyterHub you can create scripts in notebooks. Notebooks are programs which are split in multiple logical code blocks. In between those code blocks you can insert text blocks for documentation and each block can be executed individually. Each notebook @@ -172,7 +172,7 @@ style="border: 1px solid #888;" title="Error message: Directory not found"/>\</a> If the connection to your notebook server unexpectedly breaks you maybe -will get this error message. +will get this error message. Sometimes your notebook server might hit a Slurm or hardware limit and gets killed. Then usually the logfile of the corresponding Slurm job might contain useful information. These logfiles are located in your @@ -229,7 +229,7 @@ Here's a short list of some included software: ### Creating and using your own environment Interactive code interpreters which are used by Jupyter Notebooks are -called kernels. +called kernels. Creating and using your own kernel has the benefit that you can install your own preferred python packages and use them in your notebooks. @@ -373,7 +373,7 @@ mention in the same list. ### Loading modules You have now the option to preload modules from the LMOD module -system. +system. Select multiple modules that will be preloaded before your notebook server starts. The list of available modules depends on the module environment you want to start the session in (scs5 or ml). The right diff --git a/doc.zih.tu-dresden.de/docs/access/jupyterhub_for_teaching.md b/doc.zih.tu-dresden.de/docs/access/jupyterhub_for_teaching.md index ef3dacca8c243374e9efc44268b6277be5ebe2f1..784d52f40772784db9f2c44bb57af2f2d055f53b 100644 --- a/doc.zih.tu-dresden.de/docs/access/jupyterhub_for_teaching.md +++ b/doc.zih.tu-dresden.de/docs/access/jupyterhub_for_teaching.md @@ -47,9 +47,9 @@ src="<https://doc.zih.tu-dresden.de/hpc-wiki/pub/Compendium/JupyterHubForTeachin style="border: 1px solid #888;" title="URL with git-pull parameters"/>\</a> -This example would clone the repository +This example would clone the repository [https://github.com/jdwittenauer/ipython-notebooks]( - https://github.com/jdwittenauer/ipython-notebooks) + https://github.com/jdwittenauer/ipython-notebooks) and afterwards open the **Intro.ipynb** notebook in the given path. The following parameters are available: @@ -68,7 +68,7 @@ might help creating those links ## Spawner options passthrough with URL params The spawn form now offers a quick start mode by passing url -parameters. +parameters. An example: The following link would create a jupyter notebook session on the `interactive` partition with the `test` environment being loaded: @@ -137,5 +137,5 @@ src="<https://doc.zih.tu-dresden.de/hpc-wiki/pub/Compendium/JupyterHubForTeachin style="border: 1px solid #888;" title="URL with git-pull and quickstart parameters"/>\</a> -This link would redirect to +This link would redirect to `https://taurus.hrsk.tu-dresden.de/jupyter/user/{login}/notebooks/demo.ipynb` . diff --git a/doc.zih.tu-dresden.de/docs/application/access.md b/doc.zih.tu-dresden.de/docs/application/access.md index 9dd1c3af0f49f9fdf086210ccb28a38d7bf5d931..b396ad42a0946c22647ff7240ee156f20f2376ec 100644 --- a/doc.zih.tu-dresden.de/docs/application/access.md +++ b/doc.zih.tu-dresden.de/docs/application/access.md @@ -29,8 +29,8 @@ For obtaining access to the machines, the following forms have to be filled in: 1. an [online application](https://hpcprojekte.zih.tu-dresden.de/) form for the project (one form per project). The data will be stored automatically in a database. -1. Users/guests at TU Dresden without a ZIH-login have to fill in the following - [pdf] **todo** additionally. TUD-external Users fill please [this form (pdf)] **todo** +1. Users/guests at TU Dresden without a ZIH-login have to fill in the following + [pdf] **todo** additionally. TUD-external Users fill please [this form (pdf)] **todo** to get a login. Please sign and stamp it and send it by fax to +49 351 46342328, or by mail to TU Dresden, ZIH - Service Desk, 01062 Dresden, Germany. To add members with a valid ZIH-login to your diff --git a/doc.zih.tu-dresden.de/docs/archive/ram_disk_documentation.md b/doc.zih.tu-dresden.de/docs/archive/ram_disk_documentation.md index c7e50b20763d264214fe1ef11222739befe423ca..2f0a6071dc7aa1ecbb3e9b48563f70d39b773d7c 100644 --- a/doc.zih.tu-dresden.de/docs/archive/ram_disk_documentation.md +++ b/doc.zih.tu-dresden.de/docs/archive/ram_disk_documentation.md @@ -30,7 +30,7 @@ module load ramdisk Afterwards, the ramdisk can be created with the command ```Bash -make-ramdisk «size of the ramdisk in GB» +make-ramdisk «size of the ramdisk in GB» ``` The path to the ramdisk is fixed to `/ramdisks/«JOBID»`. @@ -63,7 +63,7 @@ this is typically that some process still has a file open within the ramdisk or that there is still a program using the ramdisk or having the ramdisk as its current path. Locating these processes, that block the destruction of the ramdisk is possible via using the command - + ```Bash lsof +d /ramdisks/«JOBID» ``` diff --git a/doc.zih.tu-dresden.de/docs/archive/system_atlas.md b/doc.zih.tu-dresden.de/docs/archive/system_atlas.md index 859dcef7ea9a311ce9de0aacc3b8df4c52ded3a0..c31a9b5dc536cbd6c76e772b317739171c83ab11 100644 --- a/doc.zih.tu-dresden.de/docs/archive/system_atlas.md +++ b/doc.zih.tu-dresden.de/docs/archive/system_atlas.md @@ -94,7 +94,7 @@ available for longer running jobs (>10 min). | `bsub -n 4 -M 1800` | All nodes | Is allowed to oversubscribe on small nodes n\[001-047\] | | `bsub -n 64 -M 1800` | `n[049-092]` | 64\*1800 will not fit onto a single small node and is therefore restricted to running on medium and large nodes | | `bsub -n 4 -M 2000` | `-n[049-092]` | Over limit for oversubscribing on small nodes `n[001-047]`, but may still go to medium nodes | -| `bsub -n 32 -M 2000` | `-n[049-092]` | Same as above | +| `bsub -n 32 -M 2000` | `-n[049-092]` | Same as above | | `bsub -n 32 -M 1880` | All nodes | Using max. 1880 MB, the job is eligible for running on any node | | `bsub -n 64 -M 2000` | `-n[085-092]` | Maximum for medium nodes is 1950 per slot - does the job **really** need **2000 MB** per process? | | `bsub -n 64 -M 1950` | `n[049-092]` | When using 1950 as maximum, it will fit to the medium nodes | diff --git a/doc.zih.tu-dresden.de/docs/data_lifecycle/overview.md b/doc.zih.tu-dresden.de/docs/data_lifecycle/overview.md index cbafd0c86b9b013a443fae6ebff28a1550a0f7e8..ac4c81a15051ef0bb58cebd6a3f93dcd68fc7067 100644 --- a/doc.zih.tu-dresden.de/docs/data_lifecycle/overview.md +++ b/doc.zih.tu-dresden.de/docs/data_lifecycle/overview.md @@ -116,10 +116,10 @@ expected? (software and version) Another important aspect is the Metadata. It is sufficient to use [Metadata](preservation_research_data.md#what-are-meta-data) for your HPC project. Metadata -standards, i.e., +standards, i.e., [Dublin core](http://dublincore.org/resources/metadata-basics/), [OME](https://www.openmicroscopy.org/), -will help to do it easier. +will help to do it easier. ### Data Hygiene diff --git a/doc.zih.tu-dresden.de/docs/data_lifecycle/workspaces.md b/doc.zih.tu-dresden.de/docs/data_lifecycle/workspaces.md index 2e3b7c10e916defe8cebf6f46fa84e177296295e..8443727ab896a13da8d76684e3524c1e21cca936 100644 --- a/doc.zih.tu-dresden.de/docs/data_lifecycle/workspaces.md +++ b/doc.zih.tu-dresden.de/docs/data_lifecycle/workspaces.md @@ -1,29 +1,37 @@ # Workspaces -## Introduction +Storage systems differ in terms of capacity, streaming bandwidth, IOPS rate, etc. Price and +efficiency don't allow to have it all in one. That is why fast parallel file systems at ZIH have +restrictions with regards to **age of files** and [quota](quotas.md). The mechanism of workspaces +enables users to better manage their HPC data. +<!--Workspaces are primarily login-related.--> +The concept of "workspaces" is common and used at a large number of HPC centers. -Storage systems come in many different ways in terms of: size, streaming bandwidth, IOPS rate. +!!! note -Price and efficiency don't allow to have it all in one. That is the reason why Taurus fast parallel -file systems have restrictions wrt. age of files. The mechanism of workspaces enables users to -better manage the data life cycle of their HPC data. Workspaces are primarily login-related. The -tool concept of "workspaces" is common in a large number of HPC centers. The idea is to request for -a workspace directory in a certain storage system - connected with an expiry date. After a grace -period the data is deleted automatically. The maximum lifetime of a workspace depends on the storage -system. All workspaces can be extended. + A workspace is a directory, with an associated expiration date, created on behalf of a user in a + certain storage system. -Use the fastest file systems according to recommendations. Please keep track of the data and move it -to a capacity oriented filesystem after the end of computations. +Once the workspace has reached its expiration date, it gets moved to a hidden directory and enters a +grace period. Once the grace period ends, the workspace is deleted permanently. The maximum lifetime +of a workspace depends on the storage system. All workspaces can be extended a certain amount of +times. -## Commands. Workspace Management. +!!! tip -The lifecycle of workspaces controls with commands. The basic commands will be presented below. + Use the faster file systems if you need to write temporary data in your computations, and use + the capacity oriented file systems if you only need to read data for your computations. Please + keep track of your data and move it to a capacity oriented filesystem after the end of your + computations. -To list all available filesystems for using workspaces use `ws_find -l` +## Workspace Management -Output: +### List Available File Systems -``` +To list all available file systems for using workspaces use: + +```bash +zih$ ws_find -l Available filesystems: scratch warm_archive @@ -31,24 +39,37 @@ ssd beegfs_global0 ``` -### Creation of the Workspace +### List Current Workspaces + +To list all workspaces you currently own, use: + +```bash +zih$ ws_list +id: test-workspace + workspace directory : /scratch/ws/0/marie-test-workspace + remaining time : 89 days 23 hours + creation time : Thu Jul 29 10:30:04 2021 + expiration date : Wed Oct 27 10:30:04 2021 + filesystem name : scratch + available extensions : 10 +``` + +### Allocate a Workspace To create a workspace in one of the listed filesystems use `ws_allocate`. It is necessary to specify a unique name and the duration of the workspace. -``` ws_allocate: [options] <workspace_name> -duration - -## +```bash +ws_allocate: [options] workspace_name duration Options: - -h [ --help] produce help message + -h [ --help] produce help message -V [ --version ] show version -d [ --duration ] arg (=1) duration in days -n [ --name ] arg workspace name -F [ --filesystem ] arg filesystem -r [ --reminder ] arg reminder to be sent n days before expiration - -m [ --mailaddress ] arg mailaddress to send reminder to (works only with tu-dresden.de addresses) + -m [ --mailaddress ] arg mailaddress to send reminder to (works only with tu-dresden.de mails) -x [ --extension ] extend workspace -u [ --username ] arg username -g [ --group ] group workspace @@ -56,75 +77,74 @@ Options: ``` -For example: +!!! example -``` -ws_allocate -F scratch -r 7 -m name.lastname@tu-dresden.de test-WS 90 -``` + ```bash + zih$ ws_allocate -F scratch -r 7 -m marie.testuser@tu-dresden.de test-workspace 90 + Info: creating workspace. + /scratch/ws/marie-test-workspace + remaining extensions : 10 + remaining time in days: 90 + ``` -The command creates a workspace with the name test-WS on the scratch filesystem for 90 days with an -e-mail reminder for 7 days before the expiration. +This will create a workspace with the name `test-workspace` on the `/scratch` file system for 90 +days with an email reminder for 7 days before the expiration. -Output: +!!! Note -``` -Info: creating workspace. -/scratch/ws/mark-SPECint -remaining extensions : 10 -remaining time in days: 90 -``` + Setting the reminder to `7` means you will get a reminder email on every day starting `7` prior + to expiration date. -<span style="color:red">Note:</span> The overview of currently used workspaces can be obtained with -the `ws_list` command. +### Extention of a Workspace -### Extention of the Workspace +The lifetime of a workspace is finite. Different file systems (storage systems) have different +maximum durations. A workspace can be extended multiple times, depending on the file system. -The lifetime of the workspace is finite. Different filesystems (storagesystems) have different -maximum durations. A workspace can be extended. +| Storage system (use with parameter -F ) | Duration, days | Extensions | Remarks | +|:------------------------------------------:|:----------:|:-------:|:---------------------------------------------------------------------------------------:| +| `ssd` | 30 | 10 | High-IOPS file system (`/lustre/ssd`) on SSDs. | +| `beegfs` | 30 | 2 | High-IOPS file system (`/lustre/ssd`) onNVMes. | +| `scratch` | 100 | 2 | Scratch file system (/scratch) with high streaming bandwidth, based on spinning disks | +| `warm_archive` | 365 | 2 | Capacity file system based on spinning disks | -The maximum duration depends on the storage system: - -| Storage system (use with parameter -F ) | Duration, days | Remarks | -|:------------------------------------------:|:----------:|:---------------------------------------------------------------------------------------:| -| ssd | 30 | High-IOPS file system (/lustre/ssd) on SSDs. | -| beegfs | 30 | High-IOPS file system (/lustre/ssd) onNVMes. | -| scratch | 100 | Scratch file system (/scratch) with high streaming bandwidth, based on spinning disks | -| warm_archive | 365 | Capacity file system based on spinning disks | - -``` -ws_extend -F scratch test-WS 100 #extend the workspace for another 100 days -``` - -Output: +To extend your workspace use the following command: ``` +zih$ ws_extend -F scratch test-workspace 100 #extend the workspace for 100 days Info: extending workspace. -/scratch/ws/masterman-test_ws +/scratch/ws/marie-test-workspace remaining extensions : 1 remaining time in days: 100 ``` -A workspace can be extended twice. With the `ws_extend` command, a new duration for the workspace is -set (not cumulative). +!!!Attention -### Deletion of the Workspace + With the `ws_extend` command, a new duration for the workspace is set. The new duration is not + added! -To delete workspace use the `ws_release` command. It is necessary to specify the name of the -workspace and the storage system in which it is located: +This means when you extend a workspace that expires in 90 days with the `ws_extend -F scratch +my-workspace 40`, it will now expire in 40 days **not** 130 days. -`ws_release -F <file system> <workspace name>` +### Deletion of a Workspace -For example: +To delete a workspace use the `ws_release` command. It is mandatory to specify the name of the +workspace and the file system in which it is located: -``` -ws_release -F scratch test_ws -``` +`ws_release -F <file system> <workspace name>` ### Restoring Expired Workspaces -At expiration time (or when you manually release your workspace), your workspace will be moved to a -special, hidden directory. For a month (in warm_archive: 2 months), you can still restore your data -into a valid workspace. For that, use +At expiration time your workspace will be moved to a special, hidden directory. For a month (in +warm_archive: 2 months), you can still restore your data into an existing workspace. + +!!!Warning + + When you release a workspace **by hand**, it will not receive a grace period and be + **permanently deleted** the **next day**. The advantage of this design is that you can create + and release workspaces inside jobs and not swamp the file system with data no one needs anymore + in the hidden directories (when workspaces are in the grace period). + +Use: ``` ws_restore -l -F scratch @@ -134,137 +154,133 @@ to get a list of your expired workspaces, and then restore them like that into a workspace 'new_ws': ``` -ws_restore -F scratch myuser-test_ws-1234567 new_ws +ws_restore -F scratch marie-test-workspace-1234567 new_ws ``` -<span style="color:red">Note:</span> the expired workspace has to be specified using the full name -as listed by `ws_restore -l`, including username prefix and timestamp suffix (otherwise, it cannot -be uniquely identified). The target workspace, on the other hand, must be given with just its short -name as listed by `ws_list`, without the username prefix. +The expired workspace has to be specified by its full name as listed by `ws_restore -l`, including +username prefix and timestamp suffix (otherwise, it cannot be uniquely identified). The target +workspace, on the other hand, must be given with just its short name, as listed by `ws_list`, +without the username prefix. + +Both workspaces must be on the same file system. The data from the old workspace will be moved into +a directory in the new workspace with the name of the old one. This means a fresh workspace works as +well as a workspace that already contains data. ## Linking Workspaces in HOME -It might be valuable to have links to personal workspaces within a certain directory, e.g., the user -home directory. The command `ws_register DIR` will create and manage links to all personal +It might be valuable to have links to personal workspaces within a certain directory, e.g., your +`home` directory. The command `ws_register DIR` will create and manage links to all personal workspaces within in the directory `DIR`. Calling this command will do the following: -- The directory `DIR` will be created if necessary +- The directory `DIR` will be created if necessary. - Links to all personal workspaces will be managed: - - Creates links to all available workspaces if not already present - - Removes links to released workspaces + - Create links to all available workspaces if not already present. + - Remove links to released workspaces. **Remark**: An automatic update of the workspace links can be invoked by putting the command -`ws_register DIR` in the user's personal shell configuration file (e.g., .bashrc, .zshrc). +`ws_register DIR` in your personal `shell` configuration file (e.g., `.bashrc`). -## How to Use Workspaces +## How to use Workspaces There are three typical options for the use of workspaces: -### Per-job storage +### Per-Job Storage A batch job needs a directory for temporary data. This can be deleted afterwards. -Here an example for the use with Gaussian: +!!! example "Use with Gaussian" -``` -#!/bin/bash -#SBATCH --partition=haswell -#SBATCH --time=96:00:00 -#SBATCH --nodes=1 -#SBATCH --ntasks=1 -#SBATCH --cpus-per-task=24 + ``` + #!/bin/bash + #SBATCH --partition=haswell + #SBATCH --time=96:00:00 + #SBATCH --nodes=1 + #SBATCH --ntasks=1 + #SBATCH --cpus-per-task=24 -module load modenv/classic -module load gaussian + module load modenv/classic + module load gaussian -COMPUTE_DIR=gaussian_$SLURM_JOB_ID -export GAUSS_SCRDIR=$(ws_allocate -F ssd $COMPUTE_DIR 7) -echo $GAUSS_SCRDIR + COMPUTE_DIR=gaussian_$SLURM_JOB_ID + export GAUSS_SCRDIR=$(ws_allocate -F ssd $COMPUTE_DIR 7) + echo $GAUSS_SCRDIR -srun g16 inputfile.gjf logfile.log + srun g16 inputfile.gjf logfile.log -test -d $GAUSS_SCRDIR && rm -rf $GAUSS_SCRDIR/* -ws_release -F ssd $COMPUTE_DIR -``` + test -d $GAUSS_SCRDIR && rm -rf $GAUSS_SCRDIR/* + ws_release -F ssd $COMPUTE_DIR + ``` Likewise, other jobs can use temporary workspaces. -### Data for a campaign - -For a series of calculations that works on the same data, you could allocate a workspace in the -scratch for e.g. 100 days: - -``` -ws_allocate -F scratch my_scratchdata 100 -``` +### Data for a Campaign -Output: +For a series of jobs or calculations that work on the same data, you should allocate a workspace +once, e.g., in `scratch` for 100 days: ``` +zih$ ws_allocate -F scratch my_scratchdata 100 Info: creating workspace. -/scratch/ws/mark-my_scratchdata +/scratch/ws/marie-my_scratchdata remaining extensions : 2 remaining time in days: 99 ``` -If you want to share it with your project group, set the correct access attributes, e.g: +You can grant your project group access rights: ``` -chmod g+wrx /scratch/ws/mark-my_scratchdata +chmod g+wrx /scratch/ws/marie-my_scratchdata ``` And verify it with: ``` -ls -la /scratch/ws/mark-my_scratchdata -``` - -Output: - -``` +zih $ ls -la /scratch/ws/marie-my_scratchdata total 8 -drwxrwx--- 2 mark hpcsupport 4096 Jul 10 09:03 . -drwxr-xr-x 5 operator adm 4096 Jul 10 09:01 .. +drwxrwx--- 2 marie hpcsupport 4096 Jul 10 09:03 . +drwxr-xr-x 5 operator adm 4096 Jul 10 09:01 .. ``` -### Mid-Term storage +### Mid-Term Storage -For data that seldomly changes but consumes a lot of space, the warm archive can be used. Note that +For data that seldom changes but consumes a lot of space, the warm archive can be used. Note that this is mounted read-only on the compute nodes, so you cannot use it as a work directory for your jobs! ``` -ws_allocate -F warm_archive my_inputdata 365 -``` - -Output: - -``` -/warm_archive/ws/mark-my_inputdata +zih$ ws_allocate -F warm_archive my_inputdata 365 +/warm_archive/ws/marie-my_inputdata remaining extensions : 2 remaining time in days: 365 ``` -<span style="color:red">Attention:</span> The warm archive is not built for billions of files. There -is a quota active of 100.000 files per group. Please archive data. To see your active quota use: +!!!Attention + + The warm archive is not built for billions of files. There + is a quota for 100.000 files per group. Please archive data. + +To see your active quota use: ``` qinfo quota /warm_archive/ws/ ``` -Note that the workspaces reside under the mountpoint /warm_archive/ws/ and not /warm_archive anymore. +Note that the workspaces reside under the mountpoint `/warm_archive/ws/` and not `/warm_archive` +anymore. ## F.A.Q **Q**: I am getting the error `Error: could not create workspace directory!` **A**: Please check the "locale" setting of your ssh client. Some clients (e.g. the one from MacOSX) -set values that are not valid on Taurus. You should overwrite LC_CTYPE and set it to a valid locale -value like: +set values that are not valid on our ZIH systems. You should overwrite `LC_CTYPE` and set it to a +valid locale value like `export LC_CTYPE=de_DE.UTF-8`. -``` -export LC_CTYPE=de_DE.UTF-8 -``` - -A list of valid locales can be retrieved via `locale -a`. Please use only UTF8 (or plain) settings. +A list of valid locales can be retrieved via `locale -a`. Please only use UTF8 (or plain) settings. Avoid "iso" codepages! + +**Q**: I am getting the error `Error: target workspace does not exist!` when trying to restore my +workspace. + +**A**: The workspace you want to restore into is either not on the same file system or you used the +wrong name. Use only the short name that is listed after `id:` when using `ws_list` diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md index bad6e1be5691fd2573355ae24af5db288a9f5929..5d78babb46b69d6dca5380febdfd0644118402d2 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/alpha_centauri.md @@ -198,6 +198,6 @@ There is a test example of a deep learning task that could be used for the test. work, Pytorch and Pillow package should be installed in your virtual environment (how it was shown above in the interactive job example) -- [example_pytorch_image_recognition.zip]**todo attachment** +- [example_pytorch_image_recognition.zip]**todo attachment** <!--%ATTACHURL%/example_pytorch_image_recognition.zip:--> <!--example_pytorch_image_recognition.zip--> diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hpcda.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hpcda.md index 29fb388f4bbb972b5de70abd3a652a33678510f7..d7bdec9afe83de27488e712b07e5fd5bdbcfcd17 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/hpcda.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/hpcda.md @@ -28,7 +28,7 @@ src="%ATTACHURL%/bandwidth.png" title="bandwidth.png" width="250" /> ## Access -- Application for access using this +- Application for access using this [Online Web Form](https://tu-dresden.de/zih/hochleistungsrechnen/zugang/hpc-da) ## Hardware Overview @@ -56,11 +56,11 @@ Additional hardware: - [Get started with HPC-DA](../software/get_started_with_hpcda.md) - [IBM Power AI](../software/power_ai.md) - [Work with Singularity Containers on Power9]**todo** Cloud -- [TensorFlow on HPC-DA (native)](../software/tensor_flow.md) -- [Tensorflow on Jupyter notebook](../software/tensor_flow_on_jupyter_notebook.md) +- [TensorFlow on HPC-DA (native)](../software/tensorflow.md) +- [Tensorflow on Jupyter notebook](../software/tensorflow_on_jupyter_notebook.md) - Create and run your own TensorFlow container for HPC-DA (Power9) (todo: no link at all in old compendium) - [TensorFlow on x86](../software/deep_learning.md) -- [PyTorch on HPC-DA (Power9)](../software/py_torch.md) +- [PyTorch on HPC-DA (Power9)](../software/pytorch.md) - [Python on HPC-DA (Power9)](../software/python.md) - [JupyterHub](../access/jupyterhub.md) - [R on HPC-DA (Power9)](../software/data_analytics_with_r.md) diff --git a/doc.zih.tu-dresden.de/docs/jobs_and_resources/overview.md b/doc.zih.tu-dresden.de/docs/jobs_and_resources/overview.md index c48a7f41b2d6a3edfb2d5142b6b65b05c28c7176..15bd9251436fcb9c2805d9e34341f4176f52c6df 100644 --- a/doc.zih.tu-dresden.de/docs/jobs_and_resources/overview.md +++ b/doc.zih.tu-dresden.de/docs/jobs_and_resources/overview.md @@ -57,7 +57,7 @@ using `sbatch [options] <job file>`. Pre-processing and post-processing of the data is a crucial part for the majority of data-dependent projects. The quality of this work influence on the computations. However, pre- and post-processing in many cases can be done completely or partially on a local pc and then transferred to the Taurus. -Please use Taurus for the computation-intensive tasks. +Please use Taurus for the computation-intensive tasks. Useful links: [Batch Systems]**todo link**, [Hardware Taurus]**todo link**, [HPC-DA]**todo link**, [Slurm]**todo link** diff --git a/doc.zih.tu-dresden.de/docs/software/containers.md b/doc.zih.tu-dresden.de/docs/software/containers.md index d0c723629cc2babca8dc8ad81d7d5db6ba2b2b9a..638b2c73bfd103d5ce8fe7cbb3cbe065874b932b 100644 --- a/doc.zih.tu-dresden.de/docs/software/containers.md +++ b/doc.zih.tu-dresden.de/docs/software/containers.md @@ -75,7 +75,7 @@ not possible for users to generate new custom containers on Taurus directly. You import an existing container from, e.g., Docker. In case you wish to create a new container, you can do so on your own local machine where you have -the necessary privileges and then simply copy your container file to Taurus and use it there. +the necessary privileges and then simply copy your container file to Taurus and use it there. This does not work on our **ml** partition, as it uses Power9 as its architecture which is different to the x86 architecture in common computers/laptops. For that you can use the @@ -159,7 +159,7 @@ $ docker push localhost:5000/alpine $ cat example.def Bootstrap: docker Registry: <a href="http://localhost:5000" rel="nofollow" target="_blank">http://localhost:5000</a> -From: alpine +From: alpine # Build singularity container $ singularity build --nohttps alpine.sif example.def diff --git a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md index 0d862c42cf588c64dc6f2c0ba440bf32750e2321..254bced046f1edff75bc0fb83ffca76f7724027e 100644 --- a/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md +++ b/doc.zih.tu-dresden.de/docs/software/data_analytics_with_r.md @@ -9,69 +9,69 @@ graphing. R possesses an extensive catalogue of statistical and graphical methods. It includes machine learning algorithms, linear regression, time series, statistical inference. -We recommend using **Haswell** and/or **Romeo** partitions to work with R. For more details -see [here](../jobs_and_resources/hardware_taurus.md). +We recommend using **Haswell** and/or **Romeo** partitions to work with R. For more details +see [here](../jobs_and_resources/hardware_taurus.md). ## R Console This is a quickstart example. The `srun` command is used to submit a real-time execution job designed for interactive use with monitoring the output. Please check -[the Slurm page](../jobs_and_resources/slurm.md) for details. +[the Slurm page](../jobs_and_resources/slurm.md) for details. ```Bash # job submission on haswell nodes with allocating: 1 task, 1 node, 4 CPUs per task with 2541 mb per CPU(core) for 1 hour tauruslogin$ srun --partition=haswell --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2541 --time=01:00:00 --pty bash -# Ensure that you are using the scs5 environment +# Ensure that you are using the scs5 environment module load modenv/scs5 -# Check all availble modules for R with version 3.6 +# Check all availble modules for R with version 3.6 module available R/3.6 -# Load default R module +# Load default R module module load R -# Checking the current R version +# Checking the current R version which R # Start R console R ``` -Using `srun` is recommended only for short test runs, while for larger runs batch jobs should be +Using `srun` is recommended only for short test runs, while for larger runs batch jobs should be used. The examples can be found [here](get_started_with_hpcda.md) or -[here](../jobs_and_resources/slurm.md). +[here](../jobs_and_resources/slurm.md). It is also possible to run `Rscript` command directly (after loading the module): ```Bash -# Run Rscript directly. For instance: Rscript /scratch/ws/0/marie-study_project/my_r_script.R +# Run Rscript directly. For instance: Rscript /scratch/ws/0/marie-study_project/my_r_script.R Rscript /path/to/script/your_script.R param1 param2 ``` ## R in JupyterHub -In addition to using interactive and batch jobs, it is possible to work with **R** using +In addition to using interactive and batch jobs, it is possible to work with **R** using [JupyterHub](../access/jupyterhub.md). -The production and test [environments](../access/jupyterhub.md#standard-environments) of +The production and test [environments](../access/jupyterhub.md#standard-environments) of JupyterHub contain R kernel. It can be started either in the notebook or in the console. ## RStudio -[RStudio](<https://rstudio.com/) is an integrated development environment (IDE) for R. It includes +[RStudio](<https://rstudio.com/) is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management. RStudio is also available on Taurus. -The easiest option is to run RStudio in JupyterHub directly in the browser. It can be started -similarly to a new kernel from [JupyterLab](../access/jupyterhub.md#jupyterlab) launcher. +The easiest option is to run RStudio in JupyterHub directly in the browser. It can be started +similarly to a new kernel from [JupyterLab](../access/jupyterhub.md#jupyterlab) launcher.  {: align="center"} Please keep in mind that it is currently not recommended to use the interactive x11 job with the -desktop version of RStudio, as described, for example, in introduction HPC-DA slides. +desktop version of RStudio, as described, for example, in introduction HPC-DA slides. ## Install Packages in R -By default, user-installed packages are saved in the users home in a subfolder depending on -the architecture (x86 or PowerPC). Therefore the packages should be installed using interactive +By default, user-installed packages are saved in the users home in a subfolder depending on +the architecture (x86 or PowerPC). Therefore the packages should be installed using interactive jobs on the compute node: ```Bash @@ -80,13 +80,13 @@ srun -p haswell --ntasks=1 --nodes=1 --cpus-per-task=4 --mem-per-cpu=2541 --time module purge module load modenv/scs5 module load R -R -e 'install.packages("package_name")' #For instance: 'install.packages("ggplot2")' +R -e 'install.packages("package_name")' #For instance: 'install.packages("ggplot2")' ``` ## Deep Learning with R -The deep learning frameworks perform extremely fast when run on accelerators such as GPU. -Therefore, using nodes with built-in GPUs ([ml](../jobs_and_resources/power9.md) or +The deep learning frameworks perform extremely fast when run on accelerators such as GPU. +Therefore, using nodes with built-in GPUs ([ml](../jobs_and_resources/power9.md) or [alpha](../jobs_and_resources/alpha_centauri.md) partitions) is beneficial for the examples here. ### R Interface to TensorFlow @@ -98,14 +98,14 @@ for numerical computation using data flow graphs. ```Bash srun --partition=ml --ntasks=1 --nodes=1 --cpus-per-task=7 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash -module purge -ml modenv/ml +module purge +ml modenv/ml ml TensorFlow ml R which python mkdir python-virtual-environments # Create a folder for virtual environments -cd python-virtual-environments +cd python-virtual-environments python3 -m venv --system-site-packages R-TensorFlow #create python virtual environment source R-TensorFlow/bin/activate #activate environment module list @@ -113,16 +113,16 @@ which R ``` Please allocate the job with respect to -[hardware specification](../jobs_and_resources/hardware_taurus.md)! Note that the nodes on `ml` +[hardware specification](../jobs_and_resources/hardware_taurus.md)! Note that the nodes on `ml` partition have 4way-SMT, so for every physical core allocated, you will always get 4\*1443Mb=5772mb. In order to interact with Python-based frameworks (like TensorFlow) `reticulate` R library is used. -To configure it to point to the correct Python executable in your virtual environment, create +To configure it to point to the correct Python executable in your virtual environment, create a file named `.Rprofile` in your project directory (e.g. R-TensorFlow) with the following contents: ```R -Sys.setenv(RETICULATE_PYTHON = "/sw/installed/Anaconda3/2019.03/bin/python") #assign the output of the 'which python' from above to RETICULATE_PYTHON +Sys.setenv(RETICULATE_PYTHON = "/sw/installed/Anaconda3/2019.03/bin/python") #assign the output of the 'which python' from above to RETICULATE_PYTHON ``` Let's start R, install some libraries and evaluate the result: @@ -137,7 +137,7 @@ tf$constant("Hello Tensorflow") #In the output 'Tesla V100-SXM2-32GB' sh ``` ??? example - The example shows the use of the TensorFlow package with the R for the classification problem + The example shows the use of the TensorFlow package with the R for the classification problem related to the MNIST dataset. ```R library(tensorflow) @@ -210,7 +210,7 @@ tf$constant("Hello Tensorflow") #In the output 'Tesla V100-SXM2-32GB' sh cat('Test loss:', scores[[1]], '\n') cat('Test accuracy:', scores[[2]], '\n') ``` - + ## Parallel Computing with R Generally, the R code is serial. However, many computations in R can be made faster by the use of @@ -219,8 +219,8 @@ amounts of data and/or use of complex models are indications to use parallelizat ### General Information about the R Parallelism -There are various techniques and packages in R that allow parallelization. This section -concentrates on most general methods and examples. The Information here is Taurus-specific. +There are various techniques and packages in R that allow parallelization. This section +concentrates on most general methods and examples. The Information here is Taurus-specific. The [parallel](https://www.rdocumentation.org/packages/parallel/versions/3.6.2) library will be used below. @@ -230,7 +230,7 @@ conflicts with other pre-installed packages. ### Basic Lapply-Based Parallelism `lapply()` function is a part of base R. lapply is useful for performing operations on list-objects. -Roughly speaking, lapply is a vectorization of the source code and it is the first step before +Roughly speaking, lapply is a vectorization of the source code and it is the first step before explicit parallelization of the code. ### Shared-Memory Parallelism @@ -240,7 +240,7 @@ lapply. The "mc" stands for "multicore". This function distributes the `lapply` multiple CPU cores to be executed in parallel. This is a simple option for parallelization. It doesn't require much effort to rewrite the serial -code to use `mclapply` function. Check out an example below. +code to use `mclapply` function. Check out an example below. ??? example ```R @@ -261,19 +261,19 @@ code to use `mclapply` function. Check out an example below. # shared-memory version - threads <- as.integer(Sys.getenv("SLURM_CPUS_ON_NODE")) + threads <- as.integer(Sys.getenv("SLURM_CPUS_ON_NODE")) # here the name of the variable depends on the correct sbatch configuration - # unfortunately the built-in function gets the total number of physical cores without + # unfortunately the built-in function gets the total number of physical cores without # taking into account allocated cores by Slurm list_of_averages <- mclapply(X=sample_sizes, FUN=average, mc.cores=threads) # apply function "average" 100 times ``` -The disadvantages of using shared-memory parallelism approach are, that the number of parallel -tasks is limited to the number of cores on a single node. The maximum number of cores on a single +The disadvantages of using shared-memory parallelism approach are, that the number of parallel +tasks is limited to the number of cores on a single node. The maximum number of cores on a single node can be found [here](../jobs_and_resources/hardware_taurus.md). -Submitting a multicore R job to Slurm is very similar to submitting an +Submitting a multicore R job to Slurm is very similar to submitting an [OpenMP Job](../jobs_and_resources/slurm.md#binding-and-distribution-of-tasks), since both are running multicore jobs on a **single** node. Below is an example: @@ -281,7 +281,7 @@ since both are running multicore jobs on a **single** node. Below is an example: #!/bin/bash #SBATCH --nodes=1 #SBATCH --tasks-per-node=1 -#SBATCH --cpus-per-task=16 +#SBATCH --cpus-per-task=16 #SBATCH --time=00:10:00 #SBATCH -o test_Rmpi.out #SBATCH -e test_Rmpi.err @@ -295,24 +295,24 @@ R CMD BATCH Rcode.R ### Distributed-Memory Parallelism -In order to go beyond the limitation of the number of cores on a single node, a cluster of workers -shall be set up. There are three options for it: MPI, PSOCK and FORK clusters. +In order to go beyond the limitation of the number of cores on a single node, a cluster of workers +shall be set up. There are three options for it: MPI, PSOCK and FORK clusters. We use `makeCluster` function from `parallel` library to create a set of copies of R processes running in parallel. The desired type of the cluster can be specified with a parameter `TYPE`. #### MPI Cluster -This way of the R parallelism uses the +This way of the R parallelism uses the [Rmpi](http://cran.r-project.org/web/packages/Rmpi/index.html) package and the [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) (Message Passing Interface) as a -"backend" for its parallel operations. The MPI-based job in R is very similar to submitting an -[MPI Job](../jobs_and_resources/slurm.md#binding-and-distribution-of-tasks) since both are running +"backend" for its parallel operations. The MPI-based job in R is very similar to submitting an +[MPI Job](../jobs_and_resources/slurm.md#binding-and-distribution-of-tasks) since both are running multicore jobs on multiple nodes. Below is an example of running R script with the Rmpi on Taurus: ```Bash #!/bin/bash #SBATCH --partition=haswell # specify the partition -#SBATCH --ntasks=32 # this parameter determines how many processes will be spawned, please use >=8 +#SBATCH --ntasks=32 # this parameter determines how many processes will be spawned, please use >=8 #SBATCH --cpus-per-task=1 #SBATCH --time=01:00:00 #SBATCH -o test_Rmpi.out @@ -324,11 +324,11 @@ module load R mpirun -np 1 R CMD BATCH Rmpi.R # specify the absolute path to the R script, like: /scratch/ws/marie-Work/R/Rmpi.R -# submit with sbatch <script_name> +# submit with sbatch <script_name> ``` Slurm option `--ntasks` controls the total number of parallel tasks. The number of -nodes required to complete this number of tasks will be automatically selected. +nodes required to complete this number of tasks will be automatically selected. However, in some specific cases, you can specify the number of nodes and the number of necessary tasks per node explicitly: @@ -344,7 +344,7 @@ module load R mpirun -np 1 R CMD BATCH --no-save --no-restore Rmpi_c.R ``` -Use an example below, where 32 global ranks are distributed over 2 nodes with 16 cores each. +Use an example below, where 32 global ranks are distributed over 2 nodes with 16 cores each. Each MPI rank has 1 core assigned to it. ??? example @@ -390,14 +390,14 @@ Another example: # cluster setup # get number of available MPI ranks - threads = mpi.universe.size()-1 + threads = mpi.universe.size()-1 print(paste("The cluster of size", threads, "will be setup...")) # initialize MPI cluster cl <- makeCluster(threads, type="MPI", outfile="") # distribute required variables for the execution over the cluster - clusterExport(cl, list("mu","sigma")) + clusterExport(cl, list("mu","sigma")) list_of_averages <- parLapply(X=sample_sizes, fun=average, cl=cl) @@ -414,12 +414,12 @@ processes by `mpirun`), since the R takes care of it with `makeCluster` function #### PSOCK cluster The `type="PSOCK"` uses TCP sockets to transfer data between nodes. PSOCK is the default on *all* -systems. The advantage of this method is that it does not require external libraries such as MPI. +systems. The advantage of this method is that it does not require external libraries such as MPI. On the other hand, TCP sockets are relatively [slow](http://glennklockwood.blogspot.com/2013/06/whats-killing-cloud-interconnect.html). Creating a PSOCK cluster is similar to launching an MPI cluster, but instead of specifying the number of parallel workers, you have to manually specify the number of nodes according to the -hardware specification and parameters of your job. +hardware specification and parameters of your job. ??? example ```R @@ -441,14 +441,14 @@ hardware specification and parameters of your job. # cluster setup # get number of available nodes (should be equal to "ntasks") - mynodes = 8 + mynodes = 8 print(paste("The cluster of size", threads, "will be setup...")) # initialize cluster cl <- makeCluster(mynodes, type="PSOCK", outfile="") # distribute required variables for the execution over the cluster - clusterExport(cl, list("mu","sigma")) + clusterExport(cl, list("mu","sigma")) list_of_averages <- parLapply(X=sample_sizes, fun=average, cl=cl) @@ -458,16 +458,16 @@ hardware specification and parameters of your job. #### FORK cluster -The `type="FORK"` method behaves exactly like the `mclapply` function discussed in the previous +The `type="FORK"` method behaves exactly like the `mclapply` function discussed in the previous section. Like `mclapply`, it can only use the cores available on a single node. However this method -requires exporting the workspace data to other processes. The FORK method in a combination with -`parLapply` function might be used in situations, where different source code should run on each +requires exporting the workspace data to other processes. The FORK method in a combination with +`parLapply` function might be used in situations, where different source code should run on each parallel process. ### Other parallel options -- [foreach](https://cran.r-project.org/web/packages/foreach/index.html) library. - It is functionally equivalent to the +- [foreach](https://cran.r-project.org/web/packages/foreach/index.html) library. + It is functionally equivalent to the [lapply-based parallelism](https://www.glennklockwood.com/data-intensive/r/lapply-parallelism.html) discussed before but based on the for-loop - [future](https://cran.r-project.org/web/packages/future/index.html) @@ -475,9 +475,9 @@ parallel process. unified Future API for sequential and parallel processing of R expression via futures - [Poor-man's parallelism](https://www.glennklockwood.com/data-intensive/r/alternative-parallelism.html#6-1-poor-man-s-parallelism) - (simple data parallelism). It is the simplest, but not an elegant way to parallelize R code. + (simple data parallelism). It is the simplest, but not an elegant way to parallelize R code. It runs several copies of the same R script where's each read different sectors of the input data - [Hands-off (OpenMP)](https://www.glennklockwood.com/data-intensive/r/alternative-parallelism.html#6-2-hands-off-parallelism) method. R has [OpenMP](https://www.openmp.org/resources/) support. Thus using OpenMP is a simple - method where you don't need to know much about the parallelism options in your code. Please be + method where you don't need to know much about the parallelism options in your code. Please be careful and don't mix this technique with other methods! diff --git a/doc.zih.tu-dresden.de/docs/software/debuggers.md b/doc.zih.tu-dresden.de/docs/software/debuggers.md index 480f78f5790e712c8726f969de72c7ab282c8ba8..165b47812b0283bc045909133c3020b96d63c4f7 100644 --- a/doc.zih.tu-dresden.de/docs/software/debuggers.md +++ b/doc.zih.tu-dresden.de/docs/software/debuggers.md @@ -159,7 +159,7 @@ noch **TODO: ddt.png title=DDT Main Window** ```Bash % module load Valgrind % valgrind ./myprog -``` +``` - for MPI parallel programs (every rank writes own valgrind logfile): diff --git a/doc.zih.tu-dresden.de/docs/software/deep_learning.md b/doc.zih.tu-dresden.de/docs/software/deep_learning.md index 32455d4b704309ac9512cc31ae5ae91492c67d5c..6439c1dc234d4cc0476c4966edf53a33d17480be 100644 --- a/doc.zih.tu-dresden.de/docs/software/deep_learning.md +++ b/doc.zih.tu-dresden.de/docs/software/deep_learning.md @@ -22,14 +22,14 @@ recommend using Ml partition [HPC-DA](../jobs_and_resources/hpcda.md). For examp module load TensorFlow ``` -There are numerous different possibilities on how to work with [TensorFlow](tensor_flow.md) on +There are numerous different possibilities on how to work with [TensorFlow](tensorflow.md) on Taurus. On this page, for all examples default, scs5 partition is used. Generally, the easiest way is using the [modules system](modules.md) and Python virtual environment (test case). However, in some cases, you may need directly installed Tensorflow stable or night releases. For this purpose use the -[EasyBuild](custom_easy_build_environment.md), [Containers](tensor_flow_container_on_hpcda.md) and see +[EasyBuild](custom_easy_build_environment.md), [Containers](tensorflow_container_on_hpcda.md) and see [the example](https://www.tensorflow.org/install/pip). For examples of using TensorFlow for ml partition -with module system see [TensorFlow page for HPC-DA](tensor_flow.md). +with module system see [TensorFlow page for HPC-DA](tensorflow.md). Note: If you are going used manually installed Tensorflow release we recommend use only stable versions. @@ -42,10 +42,10 @@ environments [ml environment and scs5 environment](modules.md#module-environment name "Keras". On this page for all examples default scs5 partition used. There are numerous different -possibilities on how to work with [TensorFlow](tensor_flow.md) and Keras +possibilities on how to work with [TensorFlow](tensorflow.md) and Keras on Taurus. Generally, the easiest way is using the [module system](modules.md) and Python virtual environment (test case) to see Tensorflow part above. -For examples of using Keras for ml partition with the module system see the +For examples of using Keras for ml partition with the module system see the [Keras page for HPC-DA](keras.md). It can either use TensorFlow as its backend. As mentioned in Keras documentation Keras capable of @@ -210,14 +210,14 @@ notebook server: jupyter notebook --generate-config ``` -Find a path of the configuration file, usually in the home under `.jupyter` directory, e.g. +Find a path of the configuration file, usually in the home under `.jupyter` directory, e.g. `/home//.jupyter/jupyter_notebook_config.py` Set a password (choose easy one for testing), which is needed later on to log into the server in browser session: ```Bash -jupyter notebook password Enter password: Verify password: +jupyter notebook password Enter password: Verify password: ``` you will get a message like that: diff --git a/doc.zih.tu-dresden.de/docs/software/get_started_with_hpcda.md b/doc.zih.tu-dresden.de/docs/software/get_started_with_hpcda.md index 1851cbf6d95fcb7fa69ce40171862312bd1a8891..29d39d3223dd2699abebe1514f8a2f34097ff5be 100644 --- a/doc.zih.tu-dresden.de/docs/software/get_started_with_hpcda.md +++ b/doc.zih.tu-dresden.de/docs/software/get_started_with_hpcda.md @@ -67,7 +67,7 @@ details check the [login page](../access/login.md). As soon as you have access to HPC-DA you have to manage your data. The main method of working with data on Taurus is using Workspaces. You could work with simple examples in your home directory -(where you are loading by default). However, in accordance with the +(where you are loading by default). However, in accordance with the [storage concept](../data_lifecycle/hpc_storage_concept2019.md) **please use** a [workspace](../data_lifecycle/workspaces.md) for your study and work projects. @@ -283,7 +283,7 @@ Several Tensorflow and PyTorch examples for the Jupyter notebook have been prepa simple tasks and models which will give you an understanding of how to work with ML frameworks and JupyterHub. It could be found as the [attachment] **todo** %ATTACHURL%/machine_learning_example.py in the bottom of the page. A detailed explanation and examples for TensorFlow can be found -[here](tensor_flow_on_jupyter_notebook.md). For the Pytorch - [here](py_torch.md). Usage information +[here](tensorflow_on_jupyter_notebook.md). For the Pytorch - [here](pytorch.md). Usage information about the environments for the JupyterHub could be found [here](../access/jupyterhub.md) in the chapter *Creating and using your own environment*. diff --git a/doc.zih.tu-dresden.de/docs/software/keras.md b/doc.zih.tu-dresden.de/docs/software/keras.md index 37bf6f4d5a04b249e27bf692499527ab9961ac37..34ce8e41511f662ec45922891547702a137129e5 100644 --- a/doc.zih.tu-dresden.de/docs/software/keras.md +++ b/doc.zih.tu-dresden.de/docs/software/keras.md @@ -5,7 +5,7 @@ Keras machine learning application on the new machine learning partition of Taurus. Keras is a high-level neural network API, -written in Python and capable of running on top of +written in Python and capable of running on top of [TensorFlow](https://github.com/tensorflow/tensorflow). In this page, [Keras](https://www.tensorflow.org/guide/keras) will be considered as a TensorFlow's high-level API for building and training @@ -28,7 +28,7 @@ options: - use Keras separately and use Tensorflow as an interface between Keras and GPUs. -**Prerequisites**: To work with Keras you, first of all, need +**Prerequisites**: To work with Keras you, first of all, need [access](../access/login.md) for the Taurus system, loaded Tensorflow module on ml partition, activated Python virtual environment. Basic knowledge about Python, SLURM system also required. @@ -41,19 +41,19 @@ There are three main options on how to work with Keras and Tensorflow on the HPC-DA: 1. Modules; 2. JupyterNotebook; 3. Containers. One of the main ways is using the **TODO LINK MISSING** (Modules system)(RuntimeEnvironment#Module_Environments) and Python virtual -environment. Please see the +environment. Please see the [Python page](./python.md) for the HPC-DA system. The information about the Jupyter notebook and the **JupyterHub** could be found [here](../access/jupyterhub.md). The use of -Containers is described [here](tensor_flow_container_on_hpcda.md). +Containers is described [here](tensorflow_container_on_hpcda.md). Keras contains numerous implementations of commonly used neural-network building blocks such as layers, [objectives](https://en.wikipedia.org/wiki/Objective_function), [activation functions](https://en.wikipedia.org/wiki/Activation_function) -[optimizers](https://en.wikipedia.org/wiki/Mathematical_optimization), +[optimizers](https://en.wikipedia.org/wiki/Mathematical_optimization), and a host of tools to make working with image and text data easier. Keras, for example, has a library for preprocessing the image data. @@ -62,7 +62,7 @@ The core data structure of Keras is a **model**, a way to organize layers. The Keras functional API is the way to go for defining as simple (sequential) as complex models, such as multi-output models, directed acyclic graphs, or models with shared -layers. +layers. ## Getting started with Keras @@ -71,14 +71,14 @@ Keras (using the module system). To get started, import [tf.keras](https://www.t as part of your TensorFlow program setup. tf.keras is TensorFlow's implementation of the [Keras API specification](https://keras.io/). This is a modified example that we -used for the [Tensorflow page](./tensor_flow.md). +used for the [Tensorflow page](./tensorflow.md). ```bash srun -p ml --gres=gpu:1 -n 1 --pty --mem-per-cpu=8000 bash module load modenv/ml #example output: The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml -mkdir python-virtual-environments +mkdir python-virtual-environments cd python-virtual-environments module load TensorFlow #example output: Module TensorFlow/1.10.0-PythonAnaconda-3.6 and 1 dependency loaded. which python @@ -164,8 +164,8 @@ Generally, for machine learning purposes ml partition is used but for some special issues, SCS5 partition can be useful. The following sbatch script will automatically execute the above Python script on ml partition. If you have a question about the sbatch script see the -article about [SLURM](./../jobs_and_resources/binding_and_distribution_of_tasks.md). -Keep in mind that you need to put the executable file (Keras_example) with +article about [SLURM](./../jobs_and_resources/binding_and_distribution_of_tasks.md). +Keep in mind that you need to put the executable file (Keras_example) with python code to the same folder as bash script or specify the path. ```bash @@ -220,13 +220,13 @@ renaming symbols, and changing default values for parameters. Thus in some cases, it makes code written for the TensorFlow 1 not compatible with TensorFlow 2. However, If you are using the high-level APIs **(tf.keras)** there may be little or no action you need to take to make -your code fully TensorFlow 2.0 [compatible](https://www.tensorflow.org/guide/migrate). +your code fully TensorFlow 2.0 [compatible](https://www.tensorflow.org/guide/migrate). It is still possible to run 1.X code, unmodified ([except for contrib](https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md) ), in TensorFlow 2.0: ```python -import tensorflow.compat.v1 as tf +import tensorflow.compat.v1 as tf tf.disable_v2_behavior() #instead of "import tensorflow as tf" ``` diff --git a/doc.zih.tu-dresden.de/docs/software/machine_learning.md b/doc.zih.tu-dresden.de/docs/software/machine_learning.md index beeb33c1be73cd8a079b1d45bbd9e1b5cd811b47..e80e6c346dfbeff977fdf74fc251507cc171bbcb 100644 --- a/doc.zih.tu-dresden.de/docs/software/machine_learning.md +++ b/doc.zih.tu-dresden.de/docs/software/machine_learning.md @@ -1,7 +1,7 @@ # Machine Learning On the machine learning nodes, you can use the tools from [IBM Power -AI](power_ai.md). +AI](power_ai.md). ## Interactive Session Examples diff --git a/doc.zih.tu-dresden.de/docs/software/mathematics.md b/doc.zih.tu-dresden.de/docs/software/mathematics.md index a348fbbc7f27367558c9cc83ae9fa17ad7113016..fc5d7e8942240c61790b1ff8671b9fa63f1eecab 100644 --- a/doc.zih.tu-dresden.de/docs/software/mathematics.md +++ b/doc.zih.tu-dresden.de/docs/software/mathematics.md @@ -1,7 +1,7 @@ # Mathematics Applications !!! cite - + Nature is written in mathematical language. (Galileo Galilei) diff --git a/Compendium_attachments/Vampir/vampir-framework.png b/doc.zih.tu-dresden.de/docs/software/misc/vampir-framework.png similarity index 100% rename from Compendium_attachments/Vampir/vampir-framework.png rename to doc.zih.tu-dresden.de/docs/software/misc/vampir-framework.png diff --git a/Compendium_attachments/Vampir/vampir_open_remote_dialog_auto_start.png b/doc.zih.tu-dresden.de/docs/software/misc/vampir-open-remote-dialog-auto-start.png similarity index 100% rename from Compendium_attachments/Vampir/vampir_open_remote_dialog_auto_start.png rename to doc.zih.tu-dresden.de/docs/software/misc/vampir-open-remote-dialog-auto-start.png diff --git a/Compendium_attachments/Vampir/vampir_open_remote_dialog_unstable.png b/doc.zih.tu-dresden.de/docs/software/misc/vampir-open-remote-dialog-unstable.png similarity index 100% rename from Compendium_attachments/Vampir/vampir_open_remote_dialog_unstable.png rename to doc.zih.tu-dresden.de/docs/software/misc/vampir-open-remote-dialog-unstable.png diff --git a/Compendium_attachments/Vampir/vampir_open_remote_dialog.png b/doc.zih.tu-dresden.de/docs/software/misc/vampir-open-remote-dialog.png similarity index 100% rename from Compendium_attachments/Vampir/vampir_open_remote_dialog.png rename to doc.zih.tu-dresden.de/docs/software/misc/vampir-open-remote-dialog.png diff --git a/doc.zih.tu-dresden.de/docs/software/modules-faq.md b/doc.zih.tu-dresden.de/docs/software/modules-faq.md deleted file mode 100644 index c4f7369aa93c16520c30c3e4e8ea655c723271b4..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/modules-faq.md +++ /dev/null @@ -1,3 +0,0 @@ -# F.A.Q - -faq diff --git a/doc.zih.tu-dresden.de/docs/software/papi_library.md b/doc.zih.tu-dresden.de/docs/software/papi_library.md deleted file mode 100644 index c3190a32296c72f0e16646f632430d98ceeda116..0000000000000000000000000000000000000000 --- a/doc.zih.tu-dresden.de/docs/software/papi_library.md +++ /dev/null @@ -1,40 +0,0 @@ -# PAPI Library - -Related work: - -* [PAPI documentation](http://icl.cs.utk.edu/projects/papi/wiki/Main_Page) -* [Intel 64 and IA-32 Architectures Software Developers Manual (Per thread/per core PMCs)] - (http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf) - -Additional sources for **Haswell** Processors: [Intel Xeon Processor E5-2600 v3 Product Family Uncore -Performance Monitoring Guide (Uncore PMCs) - Download link] -(http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v3-uncore-performance-monitoring.html) - -## Introduction - -PAPI enables users and developers to monitor how their code performs on a specific architecture. To -do so, they can register events that are counted by the hardware in performance monitoring counters -(PMCs). These counters relate to a specific hardware unit, for example a processor core. Intel -Processors used on taurus support eight PMCs per processor core. As the partitions on taurus are run -with HyperThreading Technology (HTT) enabled, each CPU can use four of these. In addition to the -**four core PMCs**, Intel processors also support **a number of uncore PMCs** for non-core -resources. (see the uncore manuals listed in top of this documentation). - -## Usage - -[Score-P](score_p.md) supports per-core PMCs. To include uncore PMCs into Score-P traces use the -software module **scorep-uncore/2016-03-29**on the Haswell partition. If you do so, disable -profiling to include the uncore measurements. This metric plugin is available at -[github](https://github.com/score-p/scorep_plugin_uncore/). - -If you want to use PAPI directly in your software, load the latest papi module, which establishes -the environment variables **PAPI_INC**, **PAPI_LIB**, and **PAPI_ROOT**. Have a look at the -[PAPI documentation](http://icl.cs.utk.edu/projects/papi/wiki/Main_Page) for details on the usage. - -## Related Software - -* [Score-P](score_p.md) -* [Linux Perf Tools](perf_tools.md) - -If you just need a short summary of your job, you might want to have a look at -[perf stat](perf_tools.md). diff --git a/doc.zih.tu-dresden.de/docs/software/perf_tools.md b/doc.zih.tu-dresden.de/docs/software/perf_tools.md index 176c772bf4bcbe3d9cf3a1eda725d0cc7f14daac..16007698726b0430f84ef20acc80cb9e1766d64d 100644 --- a/doc.zih.tu-dresden.de/docs/software/perf_tools.md +++ b/doc.zih.tu-dresden.de/docs/software/perf_tools.md @@ -4,14 +4,6 @@ entry focusses on the latter. These tools are installed on taurus, and others and provides support for sampling applications and reading performance counters. -## Installation - -On taurus load the module via - -```Bash -module load perf/r31 -``` - ## Configuration Admins can change the behaviour of the perf tools kernel part via the diff --git a/doc.zih.tu-dresden.de/docs/software/pika.md b/doc.zih.tu-dresden.de/docs/software/pika.md index 8a2b9fdb31123d64d87befdc8728ec82444eb9cf..6cfa085df5433aff220f1195f1b14d35887e0784 100644 --- a/doc.zih.tu-dresden.de/docs/software/pika.md +++ b/doc.zih.tu-dresden.de/docs/software/pika.md @@ -7,7 +7,7 @@ interface](https://selfservice.zih.tu-dresden.de/l/index.php/hpcportal/jobmonito **Hint:** To understand this small guide, it is recommended to open the [web interface](https://selfservice.zih.tu-dresden.de/l/index.php/hpcportal/jobmonitoring/z../jobs_and_resources) -in a separate window. Furthermore, at least one real HPC job should have been submitted on Taurus. +in a separate window. Furthermore, at least one real HPC job should have been submitted on Taurus. ## Overview @@ -112,7 +112,7 @@ flags in the job script: #SBATCH --exclusive #SBATCH --comment=no_monitoring ``` - + **Note:** Disabling Pika monitoring is possible only for exclusive jobs! ## Known Issues diff --git a/doc.zih.tu-dresden.de/docs/software/python.md b/doc.zih.tu-dresden.de/docs/software/python.md index 962184d3b6fbb49f27e6c526081976d1296e500f..4f3d567e53db0034c191fe8bdbd47f180b0a19bd 100644 --- a/doc.zih.tu-dresden.de/docs/software/python.md +++ b/doc.zih.tu-dresden.de/docs/software/python.md @@ -6,18 +6,18 @@ effective. Taurus allows working with a lot of available packages and libraries which give more useful functionalities and allow use all features of Python and to avoid minuses. -**Prerequisites:** To work with PyTorch you obviously need [access](../access/login.md) for the +**Prerequisites:** To work with PyTorch you obviously need [access](../access/login.md) for the Taurus system and basic knowledge about Python, Numpy and SLURM system. -**Aim** of this page is to introduce users on how to start working with Python on the +**Aim** of this page is to introduce users on how to start working with Python on the [HPC-DA](../jobs_and_resources/power9.md) system - part of the TU Dresden HPC system. There are three main options on how to work with Keras and Tensorflow on the HPC-DA: 1. Modules; 2. [JupyterNotebook](../access/jupyterhub.md); 3.[Containers](containers.md). The main way is using the [Modules system](modules.md) and Python virtual environment. -Note: You could work with simple examples in your home directory but according to -[HPCStorageConcept2019](../data_lifecycle/hpc_storage_concept2019.md) please use **workspaces** +Note: You could work with simple examples in your home directory but according to +[HPCStorageConcept2019](../data_lifecycle/hpc_storage_concept2019.md) please use **workspaces** for your study and work projects. ## Virtual environment @@ -26,7 +26,7 @@ There are two methods of how to work with virtual environments on Taurus: 1. **Vitualenv** is a standard Python tool to create isolated Python environments. - It is the preferred interface for + It is the preferred interface for managing installations and virtual environments on Taurus and part of the Python modules. 2. **Conda** is an alternative method for managing installations and @@ -80,15 +80,15 @@ environment (with using module system) ```Bash srun -p ml -N 1 -n 1 -c 7 --mem-per-cpu=5772 --gres=gpu:1 --time=04:00:00 --pty bash # Job submission in ml nodes with 1 gpu on 1 node. -module load modenv/ml +module load modenv/ml mkdir conda-virtual-environments #create a folder cd conda-virtual-environments #go to folder which python #check which python are you using module load PythonAnaconda/3.6 #load Anaconda module which python #check which python are you using now -conda create -n conda-testenv python=3.6 #create virtual environment with the name conda-testenv and Python version 3.6 -conda activate conda-testenv #activate conda-testenv virtual environment +conda create -n conda-testenv python=3.6 #create virtual environment with the name conda-testenv and Python version 3.6 +conda activate conda-testenv #activate conda-testenv virtual environment conda deactivate #Leave the virtual environment ``` @@ -126,7 +126,7 @@ the modules and packages you need. The manual server setup you can find [here](d With Jupyterhub you can work with general data analytics tools. This is the recommended way to start working with the Taurus. However, some special instruments could not be available on -the Jupyterhub. +the Jupyterhub. **Keep in mind that the remote Jupyter server can offer more freedom with settings and approaches.** @@ -142,7 +142,7 @@ parallel hardware for end-users, library writers and tool developers. ### Why use MPI? MPI provides a powerful, efficient and portable way to express parallel -programs. +programs. Among many parallel computational models, message-passing has proven to be an effective one. ### Parallel Python with mpi4py @@ -162,8 +162,8 @@ optimized communication of NumPy arrays. Mpi4py is included as an extension of the SciPy-bundle modules on taurus. -Please check the SoftwareModulesList for the modules availability. The availability of the mpi4py -in the module you can check by +Please check the SoftwareModulesList for the modules availability. The availability of the mpi4py +in the module you can check by the `module whatis <name_of_the module>` command. The `module whatis` command displays a short information and included extensions of the module. @@ -223,14 +223,14 @@ install Horovod you need to create a virtual environment and load the dependencies (e.g. MPI). Installing PyTorch can take a few hours and is not recommended -**Note:** You could work with simple examples in your home directory but **please use workspaces +**Note:** You could work with simple examples in your home directory but **please use workspaces for your study and work projects** (see the Storage concept). Setup: ```Bash srun -N 1 --ntasks-per-node=6 -p ml --time=08:00:00 --pty bash #allocate a Slurm job allocation, which is a set of resources (nodes) -module load modenv/ml #Load dependencies by using modules +module load modenv/ml #Load dependencies by using modules module load OpenMPI/3.1.4-gcccuda-2018b module load Python/3.6.6-fosscuda-2018b module load cuDNN/7.1.4.18-fosscuda-2018b @@ -272,7 +272,7 @@ TensorFlow. Adapt as required and refer to the horovod documentation for details. ```Bash -HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir horovod +HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir horovod ``` ##### Verify that Horovod works diff --git a/doc.zih.tu-dresden.de/docs/software/py_torch.md b/doc.zih.tu-dresden.de/docs/software/pytorch.md similarity index 84% rename from doc.zih.tu-dresden.de/docs/software/py_torch.md rename to doc.zih.tu-dresden.de/docs/software/pytorch.md index 5aa3c4618720f1de9290ffeceaa6ecac1d2135f9..5d02037121bc13b80d739d0364e9b0e9c50f514e 100644 --- a/doc.zih.tu-dresden.de/docs/software/py_torch.md +++ b/doc.zih.tu-dresden.de/docs/software/pytorch.md @@ -1,27 +1,27 @@ # Pytorch for Data Analytics -[PyTorch](https://pytorch.org/) is an open-source machine learning framework. -It is an optimized tensor library for deep learning using GPUs and CPUs. -PyTorch is a machine learning tool developed by Facebooks AI division +[PyTorch](https://pytorch.org/) is an open-source machine learning framework. +It is an optimized tensor library for deep learning using GPUs and CPUs. +PyTorch is a machine learning tool developed by Facebooks AI division to process large-scale object detection, segmentation, classification, etc. -PyTorch provides a core datastructure, the tensor, a multi-dimensional array that shares many -similarities with Numpy arrays. +PyTorch provides a core datastructure, the tensor, a multi-dimensional array that shares many +similarities with Numpy arrays. PyTorch also consumed Caffe2 for its backend and added support of ONNX. -**Prerequisites:** To work with PyTorch you obviously need [access](../access/login.md) for the +**Prerequisites:** To work with PyTorch you obviously need [access](../access/login.md) for the Taurus system and basic knowledge about Python, Numpy and SLURM system. -**Aim** of this page is to introduce users on how to start working with PyTorch on the +**Aim** of this page is to introduce users on how to start working with PyTorch on the [HPC-DA](../jobs_and_resources/power9.md) system - part of the TU Dresden HPC system. -There are numerous different possibilities of how to work with PyTorch on Taurus. +There are numerous different possibilities of how to work with PyTorch on Taurus. Here we will consider two main methods. 1\. The first option is using Jupyter notebook with HPC-DA nodes. The easiest way is by using [Jupyterhub](../access/jupyterhub.md). It is a recommended way for beginners in PyTorch and users who are just starting their work with Taurus. -2\. The second way is using the Modules system and Python or conda virtual environment. +2\. The second way is using the Modules system and Python or conda virtual environment. See [the Python page](python.md) for the HPC-DA system. Note: The information on working with the PyTorch using Containers could be found @@ -36,14 +36,14 @@ For working with PyTorch and python packages using virtual environments (kernels Creating and using your kernel (environment) has the benefit that you can install your preferred python packages and use them in your notebooks. -A virtual environment is a cooperatively isolated runtime environment that allows Python users and -applications to install and upgrade Python distribution packages without interfering with -the behaviour of other Python applications running on the same system. So the -[Virtual environment](https://docs.python.org/3/glossary.html#term-virtual-environment) -is a self-contained directory tree that contains a Python installation for a particular version of -Python, plus several additional packages. At its core, the main purpose of -Python virtual environments is to create an isolated environment for Python projects. -Python virtual environment is the main method to work with Deep Learning software as PyTorch on the +A virtual environment is a cooperatively isolated runtime environment that allows Python users and +applications to install and upgrade Python distribution packages without interfering with +the behaviour of other Python applications running on the same system. So the +[Virtual environment](https://docs.python.org/3/glossary.html#term-virtual-environment) +is a self-contained directory tree that contains a Python installation for a particular version of +Python, plus several additional packages. At its core, the main purpose of +Python virtual environments is to create an isolated environment for Python projects. +Python virtual environment is the main method to work with Deep Learning software as PyTorch on the HPC-DA system. ### Conda and Virtualenv @@ -51,23 +51,23 @@ HPC-DA system. There are two methods of how to work with virtual environments on Taurus: -1.**Vitualenv (venv)** is a standard Python tool to create isolated Python environments. -In general, It is the preferred interface for managing installations and virtual environments -on Taurus. -It has been integrated into the standard library under the -[venv module](https://docs.python.org/3/library/venv.html). +1.**Vitualenv (venv)** is a standard Python tool to create isolated Python environments. +In general, It is the preferred interface for managing installations and virtual environments +on Taurus. +It has been integrated into the standard library under the +[venv module](https://docs.python.org/3/library/venv.html). We recommend using **venv** to work with Python packages and Tensorflow on Taurus. -2\. The **conda** command is the interface for managing installations and virtual environments on -Taurus. -The **conda** is a tool for managing and deploying applications, environments and packages. -Conda is an open-source package management system and environment management system from Anaconda. +2\. The **conda** command is the interface for managing installations and virtual environments on +Taurus. +The **conda** is a tool for managing and deploying applications, environments and packages. +Conda is an open-source package management system and environment management system from Anaconda. The conda manager is included in all versions of Anaconda and Miniconda. -**Important note!** Due to the use of Anaconda to create PyTorch modules for the ml partition, -it is recommended to use the conda environment for working with the PyTorch to avoid conflicts over +**Important note!** Due to the use of Anaconda to create PyTorch modules for the ml partition, +it is recommended to use the conda environment for working with the PyTorch to avoid conflicts over the sources of your packages (pip or conda). -**Note:** Keep in mind that you **cannot** use conda for working with the virtual environments +**Note:** Keep in mind that you **cannot** use conda for working with the virtual environments previously created with Vitualenv tool and vice versa This example shows how to install and start working with PyTorch (with @@ -86,43 +86,43 @@ using module system) import torch torch.version.__version__ #Example output: 1.1.0 -Keep in mind that using **srun** directly on the shell will lead to blocking and launch an -interactive job. Apart from short test runs, -it is **recommended to launch your jobs into the background by using batch jobs**. -For that, you can conveniently put the parameters directly into the job file +Keep in mind that using **srun** directly on the shell will lead to blocking and launch an +interactive job. Apart from short test runs, +it is **recommended to launch your jobs into the background by using batch jobs**. +For that, you can conveniently put the parameters directly into the job file which you can submit using *sbatch [options] <job_file_name>*. ## Running the model and examples Below are examples of Jupyter notebooks with PyTorch models which you can run on ml nodes of HPC-DA. -There are two ways how to work with the Jupyter notebook on HPC-DA system. You can use a -[remote Jupyter server](deep_learning.md) or [JupyterHub](../access/jupyterhub.md). +There are two ways how to work with the Jupyter notebook on HPC-DA system. You can use a +[remote Jupyter server](deep_learning.md) or [JupyterHub](../access/jupyterhub.md). Jupyterhub is a simple and recommended way to use PyTorch. -We are using Jupyterhub for our examples. +We are using Jupyterhub for our examples. -Prepared examples of PyTorch models give you an understanding of how to work with -Jupyterhub and PyTorch models. It can be useful and instructive to start +Prepared examples of PyTorch models give you an understanding of how to work with +Jupyterhub and PyTorch models. It can be useful and instructive to start your acquaintance with PyTorch and HPC-DA system from these simple examples. JupyterHub is available here: [taurus.hrsk.tu-dresden.de/jupyter](https://taurus.hrsk.tu-dresden.de/jupyter) After login, you can start a new session by clicking on the button. -**Note:** Detailed guide (with pictures and instructions) how to run the Jupyterhub +**Note:** Detailed guide (with pictures and instructions) how to run the Jupyterhub you could find on [the page](../access/jupyterhub.md). -Please choose the "IBM Power (ppc64le)". You need to download an example -(prepared as jupyter notebook file) that already contains all you need for the start of the work. -Please put the file into your previously created virtual environment in your working directory or +Please choose the "IBM Power (ppc64le)". You need to download an example +(prepared as jupyter notebook file) that already contains all you need for the start of the work. +Please put the file into your previously created virtual environment in your working directory or use the kernel for your notebook [see Jupyterhub page](../access/jupyterhub.md). -Note: You could work with simple examples in your home directory but according to -[HPCStorageConcept2019](../data_lifecycle/hpc_storage_concept2019.md) please use **workspaces** -for your study and work projects. +Note: You could work with simple examples in your home directory but according to +[HPCStorageConcept2019](../data_lifecycle/hpc_storage_concept2019.md) please use **workspaces** +for your study and work projects. For this reason, you have to use advanced options of Jupyterhub and put "/" in "Workspace scope" field. -To download the first example (from the list below) into your previously created +To download the first example (from the list below) into your previously created virtual environment you could use the following command: ws_list #list of your workspaces @@ -136,11 +136,11 @@ placed in your virtual environment. See the [jupyterhub](../access/jupyterhub.md Examples: -1\. Simple MNIST model. The MNIST database is a large database of handwritten digits that is -commonly used for training various image processing systems. PyTorch allows us to import and -download the MNIST dataset directly from the Torchvision - package consists of datasets, +1\. Simple MNIST model. The MNIST database is a large database of handwritten digits that is +commonly used for training various image processing systems. PyTorch allows us to import and +download the MNIST dataset directly from the Torchvision - package consists of datasets, model architectures and transformations. -The model contains a neural network with sequential architecture and typical modules +The model contains a neural network with sequential architecture and typical modules for this kind of models. Recommended parameters for running this model are 1 GPU and 7 cores (28 thread) (example_MNIST_Pytorch.zip) @@ -149,10 +149,10 @@ for this kind of models. Recommended parameters for running this model are 1 GPU Open [JupyterHub](../access/jupyterhub.md) and follow instructions above. -In Jupyterhub documents are organized with tabs and a very versatile split-screen feature. -On the left side of the screen, you can open your file. Use 'File-Open from Path' -to go to your workspace (e.g. `scratch/ws/<username-name_of_your_ws>`). -You could run each cell separately step by step and analyze the result of each step. +In Jupyterhub documents are organized with tabs and a very versatile split-screen feature. +On the left side of the screen, you can open your file. Use 'File-Open from Path' +to go to your workspace (e.g. `scratch/ws/<username-name_of_your_ws>`). +You could run each cell separately step by step and analyze the result of each step. Default command for running one cell Shift+Enter'. Also, you could run all cells with the command ' run all cells' in the 'Run' Tab. @@ -160,22 +160,22 @@ run all cells' in the 'Run' Tab. ### Pre-trained networks -The PyTorch gives you an opportunity to use pre-trained models and networks for your purposes -(as a TensorFlow for instance) especially for computer vision and image recognition. As you know +The PyTorch gives you an opportunity to use pre-trained models and networks for your purposes +(as a TensorFlow for instance) especially for computer vision and image recognition. As you know computer vision is one of the fields that have been most impacted by the advent of deep learning. -We will use a network trained on ImageNet, taken from the TorchVision project, -which contains a few of the best performing neural network architectures for computer vision, -such as AlexNet, one of the early breakthrough networks for image recognition, and ResNet, -which won the ImageNet classification, detection, and localization competitions, in 2015. -[TorchVision](https://pytorch.org/vision/stable/index.html) also has easy access to datasets like -ImageNet and other utilities for getting up -to speed with computer vision applications in PyTorch. +We will use a network trained on ImageNet, taken from the TorchVision project, +which contains a few of the best performing neural network architectures for computer vision, +such as AlexNet, one of the early breakthrough networks for image recognition, and ResNet, +which won the ImageNet classification, detection, and localization competitions, in 2015. +[TorchVision](https://pytorch.org/vision/stable/index.html) also has easy access to datasets like +ImageNet and other utilities for getting up +to speed with computer vision applications in PyTorch. The pre-defined models can be found in torchvision.models. -**Important note**: For the ml nodes only the Torchvision 0.2.2. is available (10.11.20). -The last updates from IBM include only Torchvision 0.4.1 CPU version. -Be careful some features from modern versions of Torchvision are not available in the 0.2.2 +**Important note**: For the ml nodes only the Torchvision 0.2.2. is available (10.11.20). +The last updates from IBM include only Torchvision 0.4.1 CPU version. +Be careful some features from modern versions of Torchvision are not available in the 0.2.2 (e.g. some kinds of `transforms`). Always check the version with: `print(torchvision.__version__)` Examples: @@ -185,8 +185,8 @@ Recommended parameters for running this model are 1 GPU and 7 cores (28 thread). (example_Pytorch_image_recognition.zip) -Remember that for using [JupyterHub service](../access/jupyterhub.md) -for PyTorch you need to create and activate +Remember that for using [JupyterHub service](../access/jupyterhub.md) +for PyTorch you need to create and activate a virtual environment (kernel) with loaded essential modules (see "envtest" environment form the virtual environment example. @@ -195,10 +195,10 @@ Run the example in the same way as the previous example (MNIST model). ### Using Multiple GPUs with PyTorch Effective use of GPUs is essential, and it implies using parallelism in -your code and model. Data Parallelism and model parallelism are effective instruments +your code and model. Data Parallelism and model parallelism are effective instruments to improve the performance of your code in case of GPU using. -The data parallelism is a widely-used technique. It replicates the same model to all GPUs, +The data parallelism is a widely-used technique. It replicates the same model to all GPUs, where each GPU consumes a different partition of the input data. You could see this method [here](https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html). The example below shows how to solve that problem by using model @@ -209,10 +209,10 @@ devices. As the only part of a model operates on any individual device, a set of collectively serve a larger model. It is recommended to use [DistributedDataParallel] -(https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html), -instead of this class, to do multi-GPU training, even if there is only a single node. -See: Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel. -Check the [page](https://pytorch.org/docs/stable/notes/cuda.html#cuda-nn-ddp-instead) and +(https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html), +instead of this class, to do multi-GPU training, even if there is only a single node. +See: Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel. +Check the [page](https://pytorch.org/docs/stable/notes/cuda.html#cuda-nn-ddp-instead) and [Distributed Data Parallel](https://pytorch.org/docs/stable/notes/ddp.html#ddp). Examples: @@ -225,8 +225,8 @@ model are **2 GPU** and 14 cores (56 thread). (example_PyTorch_parallel.zip) -Remember that for using [JupyterHub service](../access/jupyterhub.md) -for PyTorch you need to create and activate +Remember that for using [JupyterHub service](../access/jupyterhub.md) +for PyTorch you need to create and activate a virtual environment (kernel) with loaded essential modules. Run the example in the same way as the previous examples. @@ -234,10 +234,10 @@ Run the example in the same way as the previous examples. #### Distributed data-parallel [DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel) -(DDP) implements data parallelism at the module level which can run across multiple machines. +(DDP) implements data parallelism at the module level which can run across multiple machines. Applications using DDP should spawn multiple processes and create a single DDP instance per process. -DDP uses collective communications in the [torch.distributed] -(https://pytorch.org/tutorials/intermediate/dist_tuto.html) +DDP uses collective communications in the [torch.distributed] +(https://pytorch.org/tutorials/intermediate/dist_tuto.html) package to synchronize gradients and buffers. The tutorial could be found [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). diff --git a/doc.zih.tu-dresden.de/docs/software/score_p.md b/doc.zih.tu-dresden.de/docs/software/scorep.md similarity index 76% rename from doc.zih.tu-dresden.de/docs/software/score_p.md rename to doc.zih.tu-dresden.de/docs/software/scorep.md index 6b570729e8594e6504c386ee128f4477e6472bb8..504224d8c69c81c0db890d24e3cd64413895ff8d 100644 --- a/doc.zih.tu-dresden.de/docs/software/score_p.md +++ b/doc.zih.tu-dresden.de/docs/software/scorep.md @@ -2,34 +2,34 @@ The Score-P measurement infrastructure is a highly scalable and easy-to-use tool suite for profiling, event tracing, and online analysis of HPC applications. Currently, it works with the -analysis tools Vampir, Scalasca, Periscope, and Tau. Score-P supports lots of features e.g. +analysis tools Vampir, Scalasca, and Tau. Score-P supports lots of features e.g. -* MPI, SHMEM, OpenMP, pthreads, and hybrid programs +* MPI, SHMEM, OpenMP, Pthreads, and hybrid programs * Manual source code instrumentation -* Monitoring of CUDA applications +* Monitoring of CUDA, OpenCL, and OpenACC applications * Recording hardware counter by using PAPI library -* Function filtering and grouping +* Function filtering and grouping Only the basic usage is shown in this Wiki. For a comprehensive Score-P user manual refer to the -[Score-P website](https://www.vi-hps.org/projects/score-p/). +[Score-P website](https://score-p.org/). Before using Score-P, set up the correct environment with -```Bash -module load scorep +```console +$ module load Score-P ``` To make measurements with Score-P, the user's application program needs to be instrumented, i.e., at specific important points ("events") Score-P measurement calls have to be activated. By default, Score-P handles this automatically. In order to enable instrumentation of function calls, MPI as -well as OpenMP events, the user only needs to prepend the Score-P wrapper to the usual compiler and -linker commands. The following sections show some examples depending on the parallelization type of +well as OpenMP events, the user only needs to prepend the Score-P wrapper to the usual compile and +link commands. The following sections show some examples depending on the parallelization type of the program. ## Serial Programs * original: `ifort a.f90 b.f90 -o myprog` -* with instrumentation: `scorep ifort a.f90 b.f90 -o myprog` +* with instrumentation: `scorep ifort a.f90 b.f90 -o myprog` This will instrument user functions (if supported by the compiler) and link the Score-P library. @@ -41,13 +41,13 @@ automatically: * original: `mpicc hello.c -o hello` * with instrumentation: `scorep mpicc hello.c -o hello` -MPI implementations without own compilers (as on the Altix) require the user to link the MPI library +MPI implementations without own compilers require the user to link the MPI library manually. Even in this case, Score-P will detect MPI parallelization automatically: * original: `icc hello.c -o hello -lmpi` * with instrumentation: `scorep icc hello.c -o hello -lmpi` -However, if Score-P falis to detect MPI parallelization automatically you can manually select MPI +However, if Score-P fails to detect MPI parallelization automatically you can manually select MPI instrumentation: * original: `icc hello.c -o hello -lmpi` @@ -61,17 +61,17 @@ option `--nocompiler` to disable automatic instrumentation of user functions. When Score-P detects OpenMP flags on the command line, OPARI2 is invoked for automatic source code instrumentation of OpenMP events: -* original: `ifort -openmp pi.f -o pi` -* with instrumentation: `scorep ifort -openmp pi.f -o pi` +* original: `ifort -openmp pi.f -o pi` +* with instrumentation: `scorep ifort -openmp pi.f -o pi` ## Hybrid MPI/OpenMP Parallel Programs With a combination of the above mentioned approaches, hybrid applications can be instrumented: -* original: `mpif90 -openmp hybrid.F90 -o hybrid` +* original: `mpif90 -openmp hybrid.F90 -o hybrid` * with instrumentation: `scorep mpif90 -openmp hybrid.F90 -o hybrid` -## Score-P instrumenter option overview +## Score-P Instrumenter Option Overview | Type of instrumentation | Instrumenter switch | Default value | Runtime measurement control | | --- | --- | --- | --- | @@ -90,16 +90,15 @@ After the application run, you will find an experiment directory in your current which contains all recorded data. In general, you can record a profile and/or a event trace. Whether a profile and/or a trace is recorded, is specified by the environment variables `SCOREP_ENABLE_PROFILING` and `SCOREP_ENABLE_TRACING` (see -[documentation](https://perftools.pages.jsc.fz-juelich.de/cicd/scorep/tags/scorep-7.0/html/measurement.html)). +[documentation](https://perftools.pages.jsc.fz-juelich.de/cicd/scorep/tags/latest/html/measurement.html)). If the value of this variables is zero or false, profiling/tracing is disabled. Otherwise Score-P will record a profile and/or trace. By default, profiling is enabled and tracing is disabled. For -more information please see -[the list of Score-P measurement configuration variables] -(https://perftools.pages.jsc.fz-juelich.de/cicd/scorep/tags/scorep-7.0/html/scorepmeasurementconfig.html) +more information please see the list of Score-P measurement +[configuration variables](https://perftools.pages.jsc.fz-juelich.de/cicd/scorep/tags/latest/html/scorepmeasurementconfig.html). You may start with a profiling run, because of its lower space requirements. According to profiling results, you may configure the trace buffer limits, filtering or selective recording for recording traces. Score-P allows to configure several parameters via environment variables. After the -measurement run you can find a scorep.cfg file in your experiment directory which contains the +measurement run you can find a `scorep.cfg` file in your experiment directory which contains the configuration of the measurement run. If you had not set configuration values explicitly, the file -will contain the default values. +will contain the default values. diff --git a/doc.zih.tu-dresden.de/docs/software/software_development_overview.md b/doc.zih.tu-dresden.de/docs/software/software_development_overview.md index dc55ba6c6f49666e25a0380f49d4c58dc45e9e0a..c87d4c93b5fe27ba82ca261aad359df48a7e741c 100644 --- a/doc.zih.tu-dresden.de/docs/software/software_development_overview.md +++ b/doc.zih.tu-dresden.de/docs/software/software_development_overview.md @@ -40,7 +40,7 @@ Subsections: - [Debugging Tools](Debugging Tools.md) - [Debuggers](debuggers.md) (GDB, Allinea DDT, Totalview) - [Tools to detect MPI usage errors](mpi_usage_error_detection.md) (MUST) -- PerformanceTools.md: [Score-P](score_p.md), [Vampir](vampir.md), [Papi Library](papi_library.md) +- PerformanceTools.md: [Score-P](scorep.md), [Vampir](vampir.md) - [Libraries](libraries.md) Intel Tools Seminar \[Oct. 2013\] diff --git a/doc.zih.tu-dresden.de/docs/software/tensor_flow.md b/doc.zih.tu-dresden.de/docs/software/tensorflow.md similarity index 96% rename from doc.zih.tu-dresden.de/docs/software/tensor_flow.md rename to doc.zih.tu-dresden.de/docs/software/tensorflow.md index e912c9260a4416b7211b2e25a3fc744099cdbb6d..aa0806ee9f16c0af5d27dda498661b60433cb7fd 100644 --- a/doc.zih.tu-dresden.de/docs/software/tensor_flow.md +++ b/doc.zih.tu-dresden.de/docs/software/tensorflow.md @@ -7,14 +7,14 @@ machine learning applications on the [HPC-DA](../jobs_and_resources/hpcda.md) sy \<span style="font-size: 1em;">On the machine learning nodes (machine learning partition), you can use the tools from [IBM PowerAI](power_ai.md) or the other -modules. PowerAI is an enterprise software distribution that combines popular open-source +modules. PowerAI is an enterprise software distribution that combines popular open-source deep learning frameworks, efficient AI development tools (Tensorflow, Caffe, etc). For this page and examples was used [PowerAI version 1.5.4](https://www.ibm.com/support/knowledgecenter/en/SS5SF7_1.5.4/navigation/pai_software_pkgs.html) [TensorFlow](https://www.tensorflow.org/guide/) is a free end-to-end open-source software library for dataflow and differentiable programming across many tasks. It is a symbolic math library, used primarily for machine -learning applications. It has a comprehensive, flexible ecosystem of tools, libraries and +learning applications. It has a comprehensive, flexible ecosystem of tools, libraries and community resources. It is available on taurus along with other common machine learning packages like Pillow, SciPY, Numpy. @@ -26,32 +26,32 @@ TensorFlow on the \<a href="HPCDA" target="\_self">HPC-DA\</a> system - part of the TU Dresden HPC system. There are three main options on how to work with Tensorflow on the -HPC-DA: **1.** **Modules,** **2.** **JupyterNotebook, 3. Containers**. The best option is -to use [module system](../software/runtime_environment.md#Module_Environments) and +HPC-DA: **1.** **Modules,** **2.** **JupyterNotebook, 3. Containers**. The best option is +to use [module system](../software/runtime_environment.md#Module_Environments) and Python virtual environment. Please see the next chapters and the [Python page](python.md) for the HPC-DA system. The information about the Jupyter notebook and the **JupyterHub** could be found [here](../access/jupyterhub.md). The use of -Containers is described [here](tensor_flow_container_on_hpcda.md). +Containers is described [here](tensorflow_container_on_hpcda.md). -On Taurus, there exist different module environments, each containing a set -of software modules. The default is *modenv/scs5* which is already loaded, -however for the HPC-DA system using the "ml" partition you need to use *modenv/ml*. +On Taurus, there exist different module environments, each containing a set +of software modules. The default is *modenv/scs5* which is already loaded, +however for the HPC-DA system using the "ml" partition you need to use *modenv/ml*. To find out which partition are you using use: `ml list`. -You can change the module environment with the command: +You can change the module environment with the command: module load modenv/ml -The machine learning partition is based on the PowerPC Architecture (ppc64le) -(Power9 processors), which means that the software built for x86_64 will not -work on this partition, so you most likely can't use your already locally -installed packages on Taurus. Also, users need to use the modules which are -specially made for the ml partition (from modenv/ml) and not for the rest -of Taurus (e.g. from modenv/scs5). +The machine learning partition is based on the PowerPC Architecture (ppc64le) +(Power9 processors), which means that the software built for x86_64 will not +work on this partition, so you most likely can't use your already locally +installed packages on Taurus. Also, users need to use the modules which are +specially made for the ml partition (from modenv/ml) and not for the rest +of Taurus (e.g. from modenv/scs5). -Each node on the ml partition has 6x Tesla V-100 GPUs, with 176 parallel threads -on 44 cores per node (Simultaneous multithreading (SMT) enabled) and 256GB RAM. +Each node on the ml partition has 6x Tesla V-100 GPUs, with 176 parallel threads +on 44 cores per node (Simultaneous multithreading (SMT) enabled) and 256GB RAM. The specification could be found [here](../jobs_and_resources/power9.md). %RED%Note:<span class="twiki-macro ENDCOLOR"></span> Users should not diff --git a/doc.zih.tu-dresden.de/docs/software/tensor_flow_container_on_hpcda.md b/doc.zih.tu-dresden.de/docs/software/tensorflow_container_on_hpcda.md similarity index 100% rename from doc.zih.tu-dresden.de/docs/software/tensor_flow_container_on_hpcda.md rename to doc.zih.tu-dresden.de/docs/software/tensorflow_container_on_hpcda.md diff --git a/doc.zih.tu-dresden.de/docs/software/tensor_flow_on_jupyter_notebook.md b/doc.zih.tu-dresden.de/docs/software/tensorflow_on_jupyter_notebook.md similarity index 97% rename from doc.zih.tu-dresden.de/docs/software/tensor_flow_on_jupyter_notebook.md rename to doc.zih.tu-dresden.de/docs/software/tensorflow_on_jupyter_notebook.md index 42f7e699358beeddd70fc839c574eba8be49dcce..a8dee14a25a9e7c82ed1977ad3e573defd4e791a 100644 --- a/doc.zih.tu-dresden.de/docs/software/tensor_flow_on_jupyter_notebook.md +++ b/doc.zih.tu-dresden.de/docs/software/tensorflow_on_jupyter_notebook.md @@ -1,4 +1,4 @@ -# Tensorflow on Jupyter Notebook +# Tensorflow on Jupyter Notebook %RED%Note: This page is under construction<span class="twiki-macro ENDCOLOR"></span> @@ -64,7 +64,7 @@ is a self-contained directory tree that contains a Python installation for a particular version of Python, plus several additional packages. At its core, the main purpose of Python virtual environments is to create an isolated environment for Python projects. Python virtual environment is -the main method to work with Deep Learning software as TensorFlow on the +the main method to work with Deep Learning software as TensorFlow on the [HPCDA](../jobs_and_resources/hpcda.md) system. ### Conda and Virtualenv @@ -73,12 +73,12 @@ There are two methods of how to work with virtual environments on Taurus. **Vitualenv (venv)** is a standard Python tool to create isolated Python environments. We recommend using venv to work with Tensorflow and Pytorch on Taurus. It -has been integrated into the standard library under +has been integrated into the standard library under the [venv](https://docs.python.org/3/library/venv.html). However, if you have reasons (previously created environments etc) you could easily use conda. The conda is the second way to use a virtual -environment on the Taurus. -[Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) +environment on the Taurus. +[Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) is an open-source package management system and environment management system from the Anaconda. @@ -113,7 +113,7 @@ Now you can check the working capacity of the current environment. ### Install Ipykernel Ipykernel is an interactive Python shell and a Jupyter kernel to work -with Python code in Jupyter notebooks. The IPython kernel is the Python +with Python code in Jupyter notebooks. The IPython kernel is the Python execution backend for Jupyter. The Jupyter Notebook automatically ensures that the IPython kernel is available. @@ -150,7 +150,7 @@ will recommend using Jupyterhub for our examples. JupyterHub is available [here](https://taurus.hrsk.tu-dresden.de/jupyter) -Please check updates and details [JupyterHub](../access/jupyterhub.md). However, +Please check updates and details [JupyterHub](../access/jupyterhub.md). However, the general pipeline can be briefly explained as follows. After logging, you can start a new session and configure it. There are @@ -167,9 +167,9 @@ contains all you need for the start of the work. Please put the file into your previously created virtual environment in your working directory or use the kernel for your notebook. -Note: You could work with simple examples in your home directory but according to -[new storage concept](../data_lifecycle/hpc_storage_concept2019.md) please use -[workspaces](../data_lifecycle/workspaces.md) for your study and work projects**. +Note: You could work with simple examples in your home directory but according to +[new storage concept](../data_lifecycle/hpc_storage_concept2019.md) please use +[workspaces](../data_lifecycle/workspaces.md) for your study and work projects**. For this reason, you have to use advanced options and put "/" in "Workspace scope" field. To download the first example (from the list below) into your previously @@ -183,7 +183,7 @@ created virtual environment you could use the following command: unzip Example_TensorFlow_Automobileset.zip ``` -Also, you could use kernels for all notebooks, not only for them which placed +Also, you could use kernels for all notebooks, not only for them which placed in your virtual environment. See the [jupyterhub](../access/jupyterhub.md) page. ### Examples: @@ -249,4 +249,4 @@ your study. - [Example_TensorFlow_Meteo_airport.zip]**todo**(Example_TensorFlow_Meteo_airport.zip): Example_TensorFlow_Meteo_airport.zip - [Example_TensorFlow_3D_road_network.zip]**todo**(Example_TensorFlow_3D_road_network.zip): - Example_TensorFlow_3D_road_network.zip + Example_TensorFlow_3D_road_network.zip diff --git a/doc.zih.tu-dresden.de/docs/software/vampir.md b/doc.zih.tu-dresden.de/docs/software/vampir.md index 464f29bb14ce5c775938bdbd0023767d72765287..b00f7a8d2682ff4319b07c2fc204a0440c63f90a 100644 --- a/doc.zih.tu-dresden.de/docs/software/vampir.md +++ b/doc.zih.tu-dresden.de/docs/software/vampir.md @@ -8,34 +8,36 @@ graphical displays, including state diagrams, statistics, and timelines, can be to obtain a better understanding of their parallel program inner working and to subsequently optimize it. Vampir allows to focus on appropriate levels of detail, which allows the detection and explanation of various performance bottlenecks such as load imbalances and communication -deficiencies. [Follow this link for further -information](http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/forschung/projekte/vampir). +deficiencies. Follow this +[link](http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/forschung/projekte/vampir) +for further information. -A growing number of performance monitoring environments like [VampirTrace](../archive/vampir_trace.md), -Score-P, TAU or KOJAK can produce trace files that are readable by Vampir. The tool supports trace -files in Open Trace Format (OTF, OTF2) that is developed by ZIH and its partners and is especially -designed for massively parallel programs. +[Score-P](scorep.md) is the primary code instrumentation and run-time measurement framework for +Vampir and supports various instrumentation methods, including instrumentation at source level and +at compile/link time. The tool supports trace files in Open Trace Format (OTF, OTF2) that is +developed by ZIH and its partners and is especially designed for massively parallel programs. -\<img alt="" src="%ATTACHURLPATH%/vampir-framework.png" title="Vampir Framework" /> + +{: align="center"} ## Starting Vampir Prior to using Vampir you need to set up the correct environment on one the HPC systems with: -```Bash -module load vampir +```console +$ module load Vampir ``` For members of TU Dresden the Vampir tool is also available as [download](http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/forschung/projekte/vampir/vampir_download_tu) for installation on your personal computer. -Make sure, that compressed display forwarding (e.g. `ssh -XC taurus.hrsk.tu-dresden.de`) is +Make sure, that compressed display forwarding (e.g., `ssh -YC taurus.hrsk.tu-dresden.de`) is enabled. Start the GUI by typing -```Bash -vampir +```console +$ vampir ``` on your command line or by double-clicking the Vampir icon on your personal computer. @@ -47,15 +49,15 @@ for a tutorial on using the tool. ## Using VampirServer VampirServer provides additional scalable analysis capabilities to the Vampir GUI mentioned above. -To use VampirServer on the HPC resources of TU Dresden proceed as follows: start the Vampir GUI as +To use VampirServer on the ZIH Systems proceed as follows: start the Vampir GUI as described above and use the *Open Remote* dialog with the parameters indicated in the following -figure to start and connect a VampirServer instance running on taurus.hrsk.tu-dresden.de. Make sure +figure to start and connect a VampirServer already instance running on the HPC system. Make sure to fill in your personal ZIH login name. -\<img alt="" src="%ATTACHURLPATH%/vampir_open_remote_dialog.png" -title="Vampir Open Remote Dialog" /> + +{: align="center"} -Click on the Connect button and wait until the connection is established. Enter your password when +Click on the *Connect* button and wait until the connection is established. Enter your password when requested. Depending on the available resources on the target system, this setup can take some time. Please be patient and take a look at available resources beforehand. @@ -65,20 +67,20 @@ Please be patient and take a look at available resources beforehand. VampirServer is a parallel MPI program, which can also be started manually by typing: -```Bash -vampirserver start +```console +$ vampirserver start ``` Above automatically allocates its resources via the respective batch system. Use -```Bash -vampirserver start mpi +```console +$ vampirserver start mpi ``` or -```Bash -vampirserver start srun +```console +$ vampirserver start srun ``` if you want to start vampirserver without batch allocation or inside an interactive allocation. The @@ -90,30 +92,28 @@ After scheduling this job the server prints out the port number it is serving on Connecting to the most recently started server can be achieved by entering `auto-detect` as *Setup name* in the *Open Remote* dialog of Vampir. -\<img alt="" -src="%ATTACHURLPATH%/vampir_open_remote_dialog_auto_start.png" -title="Vampir Open Remote Dialog" /> + +{: align="center"} Please make sure you stop VampirServer after finishing your work with the front-end or with -```Bash -vampirserver stop +```console +$ vampirserver stop ``` Type -```Bash -vampirserver help +```console +$ vampirserver help ``` -for further information. The [user manual of -VampirServer](http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/forschung/projekte/vampir/dateien/VampirServer-User-Manual.pdf) -can be found at *installation directory* /doc/vampirserver-manual.pdf. +for further information. The [user manual](http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/forschung/projekte/vampir/dateien/VampirServer-User-Manual.pdf) +of VampirServer can be found at `doc/vampirserver-manual.pdf` in the installation directory. Type -```Bash -which vampirserver +```console +$ which vampirserver ``` to find the revision dependent *installation directory*. @@ -123,18 +123,17 @@ to find the revision dependent *installation directory*. VampirServer listens to a given socket port. It is possible to forward this port (SSH tunnel) to a remote machine. This procedure is not recommended and not needed at ZIH. However, the following example shows -the tunneling to a VampirServer on a compute node at Taurus. The same -procedure works on Venus. +the tunneling to a VampirServer on a compute node. -Start VampirServer on Taurus and wait for its scheduling: +Start VampirServer on the ZIH system and wait for its scheduling: -```Bash -vampirserver start +```console +$ vampirserver start ``` and wait for scheduling -```Bash +```console Launching VampirServer... Submitting slurm 30 minutes job (this might take a while)... salloc: Granted job allocation 2753510 @@ -146,8 +145,8 @@ VampirServer listens on: taurusi1253:30055 Open a second console on your local desktop and create an ssh tunnel to the compute node with: -```Bash -ssh -L 30000:taurusi1253:30055 taurus.hrsk.tu-dresden.de +```console +$ ssh -L 30000:taurusi1253:30055 taurus.hrsk.tu-dresden.de ``` Now, the port 30000 on your desktop is connected to the VampirServer port 30055 at the compute node @@ -157,13 +156,12 @@ taurusi1253 of Taurus. Finally, start your local Vampir client and establish a r Remark: Please substitute the ports given in this example with appropriate numbers and available ports. -### Nightly builds (unstable) +### Nightly Builds (unstable) Expert users who subscribed to the development program can test new, unstable tool features. The corresponding Vampir and VampirServer software releases are provided as nightly builds. Unstable versions of VampirServer are also installed on the HPC systems. The most recent version can be launched/connected by entering `unstable` as *Setup name* in the *Open Remote* dialog of Vampir. -\<img alt="" -src="%ATTACHURLPATH%/vampir_open_remote_dialog_unstable.png" -title="Connecting to unstable VampirServer" /> + +{: align="center"} diff --git a/doc.zih.tu-dresden.de/mkdocs.yml b/doc.zih.tu-dresden.de/mkdocs.yml index 2ff444a02a573d86be0846cecf270401e29a3f22..cf0a7483bb9df2007143cd9f0d5da85a5df9e540 100644 --- a/doc.zih.tu-dresden.de/mkdocs.yml +++ b/doc.zih.tu-dresden.de/mkdocs.yml @@ -47,13 +47,13 @@ nav: - Data Analytics with R: software/data_analytics_with_r.md - Data Analytics with Python: software/python.md - TensorFlow: - - TensorFlow Overview: software/tensor_flow.md - - TensorFlow in Container: software/tensor_flow_container_on_hpcda.md - - TensorFlow in JupyterHub: software/tensor_flow_on_jupyter_notebook.md + - TensorFlow Overview: software/tensorflow.md + - TensorFlow in Container: software/tensorflow_container_on_hpcda.md + - TensorFlow in JupyterHub: software/tensorflow_on_jupyter_notebook.md - Keras: software/keras.md - Dask: software/dask.md - Power AI: software/power_ai.md - - PyTorch: software/py_torch.md + - PyTorch: software/pytorch.md - Apache Spark, Apache Flink, Apache Hadoop: software/big_data_frameworks.md - SCS5 Migration Hints: software/scs5_software.md - Virtual Machines: software/virtual_machines.md @@ -66,7 +66,7 @@ nav: - Debuggers: software/debuggers.md - Libraries: software/libraries.md - MPI Error Detection: software/mpi_usage_error_detection.md - - Score-P: software/score_p.md + - Score-P: software/scorep.md - PAPI Library: software/papi_library.md - Perf Tools: software/perf_tools.md - PIKA: software/pika.md diff --git a/doc.zih.tu-dresden.de/tud_theme/stylesheets/extra.css b/doc.zih.tu-dresden.de/tud_theme/stylesheets/extra.css index e0b4935efdbbeff8f726c6b4f7f99124fd943f1c..48d68e16ad923de101f5891aa48cacec677e128d 100644 --- a/doc.zih.tu-dresden.de/tud_theme/stylesheets/extra.css +++ b/doc.zih.tu-dresden.de/tud_theme/stylesheets/extra.css @@ -66,9 +66,15 @@ strong { .md-grid { max-width: 1600px; } + +.md-typeset code { + word-break: normal; +} + + /* header */ .zih-logo img{ - display: inline-none; + display: none; } @media screen and (min-width: 76.25rem) { .md-header, @@ -146,3 +152,11 @@ strong { .md-footer { background-color: var(--md-primary-fg-color); } + +.highlight .go { + user-select: none; +} + +.highlight .gp { + user-select: none; +}