From 267fb4bc5de24c45e9f28564ecef83aca0c91a72 Mon Sep 17 00:00:00 2001 From: Martin Schroschk <martin.schroschk@tu-dresden.de> Date: Mon, 27 Sep 2021 13:13:53 +0200 Subject: [PATCH] Review virtual machines pages --- .../docs/software/containers.md | 9 +- .../docs/software/virtual_machines.md | 87 ++++++++------- ...{vm_tools.md => virtual_machines_tools.md} | 104 +++++++++--------- doc.zih.tu-dresden.de/mkdocs.yml | 4 +- doc.zih.tu-dresden.de/wordlist.aspell | 3 + 5 files changed, 108 insertions(+), 99 deletions(-) rename doc.zih.tu-dresden.de/docs/software/{vm_tools.md => virtual_machines_tools.md} (50%) diff --git a/doc.zih.tu-dresden.de/docs/software/containers.md b/doc.zih.tu-dresden.de/docs/software/containers.md index 46157daf5..a6ac51769 100644 --- a/doc.zih.tu-dresden.de/docs/software/containers.md +++ b/doc.zih.tu-dresden.de/docs/software/containers.md @@ -12,11 +12,12 @@ Singularity. Information about the use of Singularity on ZIH systems can be foun In some cases using Singularity requires a Linux machine with root privileges (e.g. using the partition `ml`), the same architecture and a compatible kernel. For many reasons, users on ZIH systems cannot be granted root permissions. A solution is a Virtual Machine (VM) on the partition -`ml` which allows users to gain root permissions in an isolated environment. The corresponding -documentation can be found [here](virtual_machines.md). +`ml` which allows users to gain root permissions in an isolated environment. There are two main +options on how to work with Virtual Machines on ZIH systems: -<!--1. [VM tools](vm_tools.md): Automative algorithms for using virtual machines;--> -<!--1. [Manual method](virtual_machines.md): It required more operations but gives you more flexibility and reliability.--> +1. [VM tools](virtual_machines_tools.md): Automative algorithms for using virtual machines; +1. [Manual method](virtual_machines.md): It requires more operations but gives you more flexibility + and reliability. ## Singularity diff --git a/doc.zih.tu-dresden.de/docs/software/virtual_machines.md b/doc.zih.tu-dresden.de/docs/software/virtual_machines.md index 5104c7b35..ad2ba3984 100644 --- a/doc.zih.tu-dresden.de/docs/software/virtual_machines.md +++ b/doc.zih.tu-dresden.de/docs/software/virtual_machines.md @@ -1,88 +1,89 @@ -# Virtual machine on Taurus +# Virtual Machines -The following instructions are primarily aimed at users who want to build their -[Singularity](containers.md) containers on Taurus. +The following instructions are primarily aimed at users who want to build their own +[Singularity](containers.md) containers on ZIH systems. The Singularity container setup requires a Linux machine with root privileges, the same architecture and a compatible kernel. If some of these requirements can not be fulfilled, then there is -also the option of using the provided virtual machines on Taurus. +also the option of using the provided virtual machines (VM) on ZIH systems. -Currently, starting VMs is only possible on ML and HPDLF nodes. The VMs on the ML nodes are used to -build singularity containers for the Power9 architecture and the HPDLF nodes to build singularity -containers for the x86 architecture. +Currently, starting VMs is only possible on partitions `ml` and HPDLF. The VMs on the ML nodes are +used to build singularity containers for the Power9 architecture and the HPDLF nodes to build +Singularity containers for the x86 architecture. -## Create a virtual machine +## Create a Virtual Machine -The `--cloud=kvm` SLURM parameter specifies that a virtual machine should be started. +The `--cloud=kvm` Slurm parameter specifies that a virtual machine should be started. -### On Power9 architecture +### On Power9 Architecture -```Bash -rotscher@tauruslogin3:~> srun -p ml -N 1 -c 4 --hint=nomultithread --cloud=kvm --pty /bin/bash +```console +marie@login$ srun -p ml -N 1 -c 4 --hint=nomultithread --cloud=kvm --pty /bin/bash srun: job 6969616 queued and waiting for resources srun: job 6969616 has been allocated resources bash-4.2$ ``` -### On x86 architecture +### On x86 Architecture -```Bash -rotscher@tauruslogin3:~> srun -p hpdlf -N 1 -c 4 --hint=nomultithread --cloud=kvm --pty /bin/bash +```console +marie@login$ srun -p hpdlf -N 1 -c 4 --hint=nomultithread --cloud=kvm --pty /bin/bash srun: job 2969732 queued and waiting for resources srun: job 2969732 has been allocated resources bash-4.2$ ``` -## Access virtual machine +## Access a Virtual Machine -Since the security issue on Taurus, we restricted the file system permissions. Now you have to wait -until the file /tmp/${SLURM_JOB_USER}\_${SLURM_JOB_ID}/activate is created, then you can try to ssh -into the virtual machine (VM), but it could be that the VM needs some more seconds to boot and start -the SSH daemon. So you may need to try the `ssh` command multiple times till it succeeds. +Since the a security issue on ZIH systems, we restricted the filesystem permissions. Now you have to wait +until the file `/tmp/${SLURM_JOB_USER}\_${SLURM_JOB_ID}/activate` is created, then you can try to +connect via `ssh` into the virtual machine, but it could be that the virtual machine needs some more +seconds to boot and start the SSH daemon. So you may need to try the `ssh` command multiple times +till it succeeds. -```Bash -bash-4.2$ cat /tmp/rotscher_2759627/activate +```console +bash-4.2$ cat /tmp/marie_2759627/activate #!/bin/bash if ! grep -q -- "Key for the VM on the ml partition" "/home/rotscher/.ssh/authorized_keys" >& /dev/null; then - cat "/tmp/rotscher_2759627/kvm.pub" >> "/home/rotscher/.ssh/authorized_keys" + cat "/tmp/marie_2759627/kvm.pub" >> "/home/marie/.ssh/authorized_keys" else - sed -i "s|.*Key for the VM on the ml partition.*|ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC3siZfQ6vQ6PtXPG0RPZwtJXYYFY73TwGYgM6mhKoWHvg+ZzclbBWVU0OoU42B3Ddofld7TFE8sqkHM6M+9jh8u+pYH4rPZte0irw5/27yM73M93q1FyQLQ8Rbi2hurYl5gihCEqomda7NQVQUjdUNVc6fDAvF72giaoOxNYfvqAkw8lFyStpqTHSpcOIL7pm6f76Jx+DJg98sXAXkuf9QK8MurezYVj1qFMho570tY+83ukA04qQSMEY5QeZ+MJDhF0gh8NXjX/6+YQrdh8TklPgOCmcIOI8lwnPTUUieK109ndLsUFB5H0vKL27dA2LZ3ZK+XRCENdUbpdoG2Czz Key for the VM on the ml partition|" "/home/rotscher/.ssh/authorized_keys" + sed -i "s|.*Key for the VM on the ml partition.*|ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC3siZfQ6vQ6PtXPG0RPZwtJXYYFY73TwGYgM6mhKoWHvg+ZzclbBWVU0OoU42B3Ddofld7TFE8sqkHM6M+9jh8u+pYH4rPZte0irw5/27yM73M93q1FyQLQ8Rbi2hurYl5gihCEqomda7NQVQUjdUNVc6fDAvF72giaoOxNYfvqAkw8lFyStpqTHSpcOIL7pm6f76Jx+DJg98sXAXkuf9QK8MurezYVj1qFMho570tY+83ukA04qQSMEY5QeZ+MJDhF0gh8NXjX/6+YQrdh8TklPgOCmcIOI8lwnPTUUieK109ndLsUFB5H0vKL27dA2LZ3ZK+XRCENdUbpdoG2Czz Key for the VM on the ml partition|" "/home/marie/.ssh/authorized_keys" fi -ssh -i /tmp/rotscher_2759627/kvm root@192.168.0.6 -bash-4.2$ source /tmp/rotscher_2759627/activate +ssh -i /tmp/marie_2759627/kvm root@192.168.0.6 +bash-4.2$ source /tmp/marie_2759627/activate Last login: Fri Jul 24 13:53:48 2020 from gateway -[root@rotscher_2759627 ~]# +[root@marie_2759627 ~]# ``` -## Example usage +## Example Usage ## Automation -We provide [Tools](vm_tools.md) to automate these steps. You may just type `startInVM --arch=power9` -on a tauruslogin node and you will be inside the VM with everything mounted. +We provide [tools](virtual_machines_tools.md) to automate these steps. You may just type `startInVM +--arch=power9` on a login node and you will be inside the VM with everything mounted. ## Known Issues ### Temporary Memory -The available space inside the VM can be queried with `df -h`. Currently the whole VM has 8G and -with the installed operating system, 6.6GB of available space. +The available space inside the VM can be queried with `df -h`. Currently the whole VM has 8 GB and +with the installed operating system, 6.6 GB of available space. Sometimes the Singularity build might fail because of a disk out-of-memory error. In this case it might be enough to delete leftover temporary files from Singularity: -```Bash +```console rm -rf /tmp/sbuild-* ``` If that does not help, e.g., because one build alone needs more than the available disk memory, then it will be necessary to use the tmp folder on scratch. In order to ensure that the files in the -temporary folder will be owned by root, it is necessary to set up an image inside /scratch/tmp -instead of using it directly. E.g., to create a 25GB of temporary memory image: +temporary folder will be owned by root, it is necessary to set up an image inside `/scratch/tmp` +instead of using it directly. E.g., to create a 25 GB of temporary memory image: -```Bash +```console tmpDir="$( mktemp -d --tmpdir=/host_data/tmp )" && tmpImg="$tmpDir/singularity-build-temp-dir" export LANG_BACKUP=$LANG unset LANG @@ -90,13 +91,17 @@ truncate -s 25G "$tmpImg.ext4" && echo yes | mkfs.ext4 "$tmpImg.ext4" export LANG=$LANG_BACKUP ``` -The image can now be mounted and with the **SINGULARITY_TMPDIR** environment variable can be +The image can now be mounted and with the `SINGULARITY_TMPDIR` environment variable can be specified as the temporary directory for Singularity builds. Unfortunately, because of an open Singularity [bug](https://github.com/sylabs/singularity/issues/32) it is should be avoided to mount -the image using **/dev/loop0**. +the image using `/dev/loop0`. -```Bash -mkdir -p "$tmpImg" && i=1 && while test -e "/dev/loop$i"; do (( ++i )); done && mknod -m 0660 "/dev/loop$i" b 7 "$i"<br />mount -o loop="/dev/loop$i" "$tmpImg"{.ext4,}<br /><br />export SINGULARITY_TMPDIR="$tmpImg"<br /><br />singularity build my-container.{sif,def} +```console +mkdir -p "$tmpImg" && i=1 && while test -e "/dev/loop$i"; do (( ++i )); done && mknod -m 0660 "/dev/loop$i" b 7 "$i" +mount -o loop="/dev/loop$i" "$tmpImg"{.ext4,} + +export SINGULARITY_TMPDIR="$tmpImg" +singularity build my-container.{sif,def} ``` The architecture of the base image is automatically chosen when you use an image from DockerHub. @@ -106,4 +111,4 @@ Bootstraps **shub** and **library** should be avoided. ### Transport Endpoint is not Connected This happens when the SSHFS mount gets unmounted because it is not very stable. It is sufficient to -run `\~/mount_host_data.sh` again or just the sshfs command inside that script. +run `\~/mount_host_data.sh` again or just the SSHFS command inside that script. diff --git a/doc.zih.tu-dresden.de/docs/software/vm_tools.md b/doc.zih.tu-dresden.de/docs/software/virtual_machines_tools.md similarity index 50% rename from doc.zih.tu-dresden.de/docs/software/vm_tools.md rename to doc.zih.tu-dresden.de/docs/software/virtual_machines_tools.md index 5a4d58a7e..0b03ddf92 100644 --- a/doc.zih.tu-dresden.de/docs/software/vm_tools.md +++ b/doc.zih.tu-dresden.de/docs/software/virtual_machines_tools.md @@ -1,71 +1,70 @@ -# Singularity on Power9 / ml partition +# Singularity on Partition `ml` -Building Singularity containers from a recipe on Taurus is normally not possible due to the -requirement of root (administrator) rights, see [Containers](containers.md). For obvious reasons -users on Taurus cannot be granted root permissions. +!!! note "Root privileges" -The solution is to build your container on your local Linux machine by executing something like + Building Singularity containers from a recipe on ZIH system is normally not possible due to the + requirement of root (administrator) rights, see [Containers](containers.md). For obvious reasons + users cannot be granted root permissions. -```Bash -sudo singularity build myContainer.sif myDefinition.def -``` - -Then you can copy the resulting myContainer.sif to Taurus and execute it there. +The solution is to build your container on your local Linux workstation using Singularity and copy +it to ZIH systems for execution. -This does **not** work on the ml partition as it uses the Power9 architecture which your laptop -likely doesn't. +**This does not work on the partition `ml`** as it uses the Power9 architecture which your +workstation likely doesn't. -For this we provide a Virtual Machine (VM) on the ml partition which allows users to gain root +For this we provide a Virtual Machine (VM) on the partition `ml` which allows users to gain root permissions in an isolated environment. The workflow to use this manually is described at -[another page](virtual_machines.md) but is quite cumbersome. +[this page](virtual_machines.md) but is quite cumbersome. To make this easier two programs are provided: `buildSingularityImage` and `startInVM` which do what they say. The latter is for more advanced use cases so you should be fine using -*buildSingularityImage*, see the following section. +`buildSingularityImage`, see the following section. -**IMPORTANT:** You need to have your default SSH key without a password for the scripts to work as -entering a password through the scripts is not supported. +!!! note "SSH key without password" + + You need to have your default SSH key without a password for the scripts to work as + entering a password through the scripts is not supported. **The recommended workflow** is to create and test a definition file locally. You usually start from a base Docker container. Those typically exist for different architectures but with a common name -(e.g. 'ubuntu:18.04'). Singularity automatically uses the correct Docker container for your current +(e.g. `ubuntu:18.04`). Singularity automatically uses the correct Docker container for your current architecture when building. So in most cases you can write your definition file, build it and test -it locally, then move it to Taurus and build it on Power9 without any further changes. However, -sometimes Docker containers for different architectures have different suffixes, in which case you'd -need to change that when moving to Taurus. +it locally, then move it to ZIH systems and build it on Power9 (partition `ml`) without any further +changes. However, sometimes Docker containers for different architectures have different suffixes, +in which case you'd need to change that when moving to ZIH systems. -## Building a Singularity container in a job +## Build a Singularity Container in a Job -To build a singularity container on Taurus simply run: +To build a Singularity container on ZIH systems simply run: -```Bash -buildSingularityImage --arch=power9 myContainer.sif myDefinition.def +```console +marie@login$ buildSingularityImage --arch=power9 myContainer.sif myDefinition.def ``` -This command will submit a batch job and immediately return. Note that while "power9" is currently +This command will submit a batch job and immediately return. Note that while Power9 is currently the only supported architecture, the parameter is still required. If you want it to block while the -image is built and see live output, use the parameter `--interactive`: +image is built and see live output, add the option `--interactive`: -```Bash -buildSingularityImage --arch=power9 --interactive myContainer.sif myDefinition.def +```console +marie@login$ buildSingularityImage --arch=power9 --interactive myContainer.sif myDefinition.def ``` There are more options available which can be shown by running `buildSingularityImage --help`. All have reasonable defaults.The most important ones are: -- `--time <time>`: Set a higher job time if the default time is not - enough to build your image and your job is cancelled before completing. The format is the same - as for SLURM. -- `--tmp-size=<size in GB>`: Set a size used for the temporary +* `--time <time>`: Set a higher job time if the default time is not + enough to build your image and your job is canceled before completing. The format is the same as + for Slurm. +* `--tmp-size=<size in GB>`: Set a size used for the temporary location of the Singularity container. Basically the size of the extracted container. -- `--output=<file>`: Path to a file used for (log) output generated +* `--output=<file>`: Path to a file used for (log) output generated while building your container. -- Various singularity options are passed through. E.g. +* Various Singularity options are passed through. E.g. `--notest, --force, --update`. See, e.g., `singularity --help` for details. For **advanced users** it is also possible to manually request a job with a VM (`srun -p ml --cloud=kvm ...`) and then use this script to build a Singularity container from within the job. In -this case the `--arch` and other SLURM related parameters are not required. The advantage of using +this case the `--arch` and other Slurm related parameters are not required. The advantage of using this script is that it automates the waiting for the VM and mounting of host directories into it (can also be done with `startInVM`) and creates a temporary directory usable with Singularity inside the VM controlled by the `--tmp-size` parameter. @@ -78,31 +77,31 @@ As the build starts in a VM you may not have access to all your files. It is us to refer to local files from inside a definition file anyway as this reduces reproducibility. However common directories are available by default. For others, care must be taken. In short: -- `/home/$USER`, `/scratch/$USER` are available and should be used `/scratch/\<group>` also works for -- all groups the users is in `/projects/\<group>` similar, but is read-only! So don't use this to +* `/home/$USER`, `/scratch/$USER` are available and should be used `/scratch/\<group>` also works for +* all groups the users is in `/projects/\<group>` similar, but is read-only! So don't use this to store your generated container directly, but rather move it here afterwards -- /tmp is the VM local temporary directory. All files put here will be lost! +* /tmp is the VM local temporary directory. All files put here will be lost! If the current directory is inside (or equal to) one of the above (except `/tmp`), then relative paths for container and definition work as the script changes to the VM equivalent of the current directory. Otherwise you need to use absolute paths. Using `~` in place of `$HOME` does work too. -Under the hood, the filesystem of Taurus is mounted via SSHFS at `/host_data`, so if you need any +Under the hood, the filesystem of ZIH systems is mounted via SSHFS at `/host_data`, so if you need any other files they can be found there. -There is also a new SSH key named "kvm" which is created by the scripts and authorized inside the VM -to allow for password-less access to SSHFS. This is stored at `~/.ssh/kvm` and regenerated if it +There is also a new SSH key named `kvm` which is created by the scripts and authorized inside the VM +to allow for password-less access to SSHFS. This is stored at `~/.ssh/kvm` and regenerated if it does not exist. It is also added to `~/.ssh/authorized_keys`. Note that removing the key file does not remove it from `authorized_keys`, so remove it manually if you need to. It can be easily -identified by the comment on the key. However, removing this key is **NOT** recommended, as it +identified by the comment on the key. However, removing this key is **NOT** recommended, as it needs to be re-generated on every script run. -## Starting a Job in a VM +## Start a Job in a VM Especially when developing a Singularity definition file it might be useful to get a shell directly on a VM. To do so simply run: -```Bash +```console startInVM --arch=power9 ``` @@ -114,10 +113,11 @@ build` commands. As usual more options can be shown by running `startInVM --help`, the most important one being `--time`. -There are 2 special use cases for this script: 1 Execute an arbitrary command inside the VM instead -of getting a bash by appending the command to the script. Example: \<pre>startInVM --arch=power9 -singularity build \~/myContainer.sif \~/myDefinition.def\</pre> 1 Use the script in a job manually -allocated via srun/sbatch. This will work the same as when running outside a job but will **not** -start a new job. This is useful for using it inside batch scripts, when you already have an -allocation or need special arguments for the job system. Again you can run an arbitrary command by -passing it to the script. +There are two special use cases for this script: + +1. Execute an arbitrary command inside the VM instead of getting a bash by appending the command to + the script. Example: `startInVM --arch=power9 singularity build \~/myContainer.sif \~/myDefinition.de` +1. Use the script in a job manually allocated via srun/sbatch. This will work the same as when + running outside a job but will **not** start a new job. This is useful for using it inside batch + scripts, when you already have an allocation or need special arguments for the job system. Again + you can run an arbitrary command by passing it to the script. diff --git a/doc.zih.tu-dresden.de/mkdocs.yml b/doc.zih.tu-dresden.de/mkdocs.yml index d53d4adb6..4a520ac4a 100644 --- a/doc.zih.tu-dresden.de/mkdocs.yml +++ b/doc.zih.tu-dresden.de/mkdocs.yml @@ -31,7 +31,8 @@ nav: - Containers: - Singularity: software/containers.md - Singularity Recipes and Hints: software/singularity_recipe_hints.md - - VM tools: software/vm_tools.md + - Virtual Machines Tools: software/virtual_machines_tools.md + - Virtual Machines: software/virtual_machines.md - Applications: - Licenses: software/licenses.md - Computational Fluid Dynamics (CFD): software/cfd.md @@ -54,7 +55,6 @@ nav: - Hyperparameter Optimization (OmniOpt): software/hyperparameter_optimization.md - PowerAI: software/power_ai.md - SCS5 Migration Hints: software/scs5_software.md - - Virtual Machines: software/virtual_machines.md - Virtual Desktops: software/virtual_desktops.md - Software Development and Tools: - Overview: software/software_development_overview.md diff --git a/doc.zih.tu-dresden.de/wordlist.aspell b/doc.zih.tu-dresden.de/wordlist.aspell index 410e426d8..a90a4195a 100644 --- a/doc.zih.tu-dresden.de/wordlist.aspell +++ b/doc.zih.tu-dresden.de/wordlist.aspell @@ -148,6 +148,7 @@ queue randint reachability README +reproducibility RHEL Rmpi rome @@ -180,6 +181,7 @@ SMT squeue srun ssd +SSHFS stderr stdout SUSE @@ -204,6 +206,7 @@ vectorization venv virtualenv VirtualGL +VMs WebVNC WinSCP Workdir -- GitLab