Merge branch 'merge-preview-in-main' into 'main'

Automated merge from preview to main See merge request !1101

Merge branch 'merge-preview-in-main' into 'main'
Automated merge from preview to main See merge request !1101
1e0846e1 · Jan Frenzel · 2f7061a8 · 668bf6d5 · 1e0846e1 · 1e0846e1
Commit 1e0846e1 authored 8 months ago by Jan Frenzel
--- a/doc.zih.tu-dresden.de/docs/data_lifecycle/file_systems.md
+++ b/doc.zih.tu-dresden.de/docs/data_lifecycle/file_systems.md
@@ -26,7 +26,7 @@ Getting high I/O-bandwidth
 - Use many processes (writing in the same file at the same time is possible)
 - Use large I/O transfer blocks
 - Avoid reading many small files. Use data container e. g.
-  [ratarmount](../software/utilities.md#direct-archive-access-without-extraction)
+  [ratarmount](../software/utilities.md#direct-archive-access-without-extraction-using-ratarmount)
  to bundle small files into one

 ## Cheat Sheet for Debugging Filesystem Issues

--- a/doc.zih.tu-dresden.de/docs/data_lifecycle/working.md
+++ b/doc.zih.tu-dresden.de/docs/data_lifecycle/working.md
@@ -48,8 +48,8 @@ Getting high I/O-bandwidth
 - Use many processes (writing in the same file at the same time is possible)
 - Use large I/O transfer blocks
 - Avoid reading many small files. Use data container e. g.
-  [ratarmount](../software/utilities.md#direct-archive-access-without-extraction) to bundle
-  small files into one
+  [ratarmount](../software/utilities.md#direct-archive-access-without-extraction-using-ratarmount)
+  to bundle small files into one

 ## Cheat Sheet for Debugging Filesystem Issues


--- a/doc.zih.tu-dresden.de/docs/software/utilities.md
+++ b/doc.zih.tu-dresden.de/docs/software/utilities.md
@@ -222,7 +222,7 @@ marie@compute$ tar --use-compress-program=rapidgzip -xf my-archive.tar.gz
 Rapidgzip is still in development, so if it crashes or if it is slower than the system `gzip`,
 please [open an issue](https://github.com/mxmlnkn/rapidgzip/issues) on GitHub.

-### Direct Archive Access Without Extraction
+### Direct Archive Access Without Extraction Using Ratarmount

 In some cases of archives with millions of small files, it might not be feasible to extract the
 whole archive to a filesystem.
@@ -238,30 +238,79 @@ minutes per file access.
 Furthermore, the analysis results of the archive will be stored in a sidecar file alongside the
 archive or in your home directory if the archive is in a non-writable location.
 Subsequent mounts instantly load that sidecar file instead of reanalyzing the archive.
+You will find further information on the [Ratarmount GitHub page](https://github.com/mxmlnkn/ratarmount).

-[Ratarmount](https://github.com/mxmlnkn/ratarmount) is available on PyPI and can be installed via pip.
-It is recommended to install it inside a [Python virtual environment](python_virtual_environments.md).
+#### Example Workflow

-```console
-marie@compute$ pip install ratarmount
-```
+The software Ratarmount is installed system-wide on the HPC system.

-After that, you can use ratarmount to mount a TAR file using the following approach:
+The first step is to create a tar archive to bundle your small files in a single file.

 ```bash
-marie@compute$ ratarmount <compressed_file> <mountpoint>
+# On your local machine
+marie@local$ tar cf dataset.tar folder_containing_my_small_files
+
+# If your small files are already on the HPC system
+marie@login$ dttar cf dataset.tar folder_containing_my_small_files
 ```

-Thus, you could invoke ratarmount as follows:
+For the latter, please make sure that you are on a [Datamover node](../data_transfer/datamover.md)
+and **not** on a login node.
+Depending on the number of files, the tar bundle process may take some time.

-```console
-marie@compute$ ratarmount inputdata.tar.gz input-folder
+We do not recommend to compress (e.g. Gzip) the archive, as this can decrease the read performance substantially
+e.g. for images, audio and video files.

-# Now access the data as if it was a directory, e.g.:
-marie@compute$ cat input-folder/input-file1
-```
+Once the tar archive has been created, you can mount it on the compute node using `ratarmount'.
+All files in the mount points can be accessed as normal files or directories
+in the filesystem without any special treatment.
+Note that the tar archive must be mounted on every compute node in your job.
+
+!!! note
+
+    Mounting an archive for the first time can take some time  because Ratarmount has to create an index of its contents to access it efficiently.
+    The index, named `.<name_of_the_archive>.index.sqlite`, will be placed
+    in the same directory as the archive if the directory is writable,
+    otherwise ratarmount will try to place the index in your home directory.
+    This indexing step could be done in a separate job to save resources.
+    It also prevents conflicting indexing by more than one process at the same time.
+
+    ```bash
+    # create index
+    sbatch --ntasks=1 --mem=10G --time=5:00:00 ratarmount dataset.tar
+    ```
+
+!!! example "Example job script using Ratarmount"
+
+    ```bash
+    #!/bin/bash
+
+    #SBATCH --ntasks=3
+    #SBATCH --nodes=2
+    #SBATCH --time=00:05:00
+
+
+    # mount the dataset on every node one time
+    DATASET=/tmp/${SLURM_JOB_ID}
+    srun --ntasks-per-node=1 mkdir ${DATASET}
+    srun --ntasks-per-node=1 ratarmount dataset.tar ${DATASET}
+
+    # now it can be accessed like a normal directory
+    srun --ntasks=1 ls ${DATASET}
+
+    # start the application
+    srun ./my_application --input-directory ${DATASET}
+
+    # unmount it after all work is done
+    srun --ntasks-per-node=1 ratarmount -u ${DATASET}
+    ```
+
+!!! hint
+
+    If you are starting many processes per node, Ratarmount could benefit from
+    having individual mount points for each process, rather than just one per node.

-Ratarmount is still in development, so if there are problems or if it is unexpectedly slow,
+In case of Ratarmount issues
 please [open an issue](https://github.com/mxmlnkn/ratarmount/issues) on GitHub.

 There also is a library interface called