diff --git a/doc.zih.tu-dresden.de/docs/data_lifecycle/file_systems.md b/doc.zih.tu-dresden.de/docs/data_lifecycle/file_systems.md index e2eed87b3fe7f28bc9de14767f6ad6ab00b0fe6f..e2a2f828f4ad4a575537957d40420f1c6dcd0fc1 100644 --- a/doc.zih.tu-dresden.de/docs/data_lifecycle/file_systems.md +++ b/doc.zih.tu-dresden.de/docs/data_lifecycle/file_systems.md @@ -26,7 +26,7 @@ Getting high I/O-bandwidth - Use many processes (writing in the same file at the same time is possible) - Use large I/O transfer blocks - Avoid reading many small files. Use data container e. g. - [ratarmount](../software/utilities.md#direct-archive-access-without-extraction) + [ratarmount](../software/utilities.md#direct-archive-access-without-extraction-using-ratarmount) to bundle small files into one ## Cheat Sheet for Debugging Filesystem Issues diff --git a/doc.zih.tu-dresden.de/docs/data_lifecycle/working.md b/doc.zih.tu-dresden.de/docs/data_lifecycle/working.md index 7588d89538f5fd24f3400d85c745aa9b6788e9db..180b425f58ee32ef06ec3c8a11db28f25c06bb58 100644 --- a/doc.zih.tu-dresden.de/docs/data_lifecycle/working.md +++ b/doc.zih.tu-dresden.de/docs/data_lifecycle/working.md @@ -48,8 +48,8 @@ Getting high I/O-bandwidth - Use many processes (writing in the same file at the same time is possible) - Use large I/O transfer blocks - Avoid reading many small files. Use data container e. g. - [ratarmount](../software/utilities.md#direct-archive-access-without-extraction) to bundle - small files into one + [ratarmount](../software/utilities.md#direct-archive-access-without-extraction-using-ratarmount) + to bundle small files into one ## Cheat Sheet for Debugging Filesystem Issues diff --git a/doc.zih.tu-dresden.de/docs/software/utilities.md b/doc.zih.tu-dresden.de/docs/software/utilities.md index 8dc71743a1bb87d772e4f687f5fb0501e4cd5a77..c74660944c602002d17ff12d9c7e9913b1f64664 100644 --- a/doc.zih.tu-dresden.de/docs/software/utilities.md +++ b/doc.zih.tu-dresden.de/docs/software/utilities.md @@ -222,7 +222,7 @@ marie@compute$ tar --use-compress-program=rapidgzip -xf my-archive.tar.gz Rapidgzip is still in development, so if it crashes or if it is slower than the system `gzip`, please [open an issue](https://github.com/mxmlnkn/rapidgzip/issues) on GitHub. -### Direct Archive Access Without Extraction +### Direct Archive Access Without Extraction Using Ratarmount In some cases of archives with millions of small files, it might not be feasible to extract the whole archive to a filesystem. @@ -238,30 +238,79 @@ minutes per file access. Furthermore, the analysis results of the archive will be stored in a sidecar file alongside the archive or in your home directory if the archive is in a non-writable location. Subsequent mounts instantly load that sidecar file instead of reanalyzing the archive. +You will find further information on the [Ratarmount GitHub page](https://github.com/mxmlnkn/ratarmount). -[Ratarmount](https://github.com/mxmlnkn/ratarmount) is available on PyPI and can be installed via pip. -It is recommended to install it inside a [Python virtual environment](python_virtual_environments.md). +#### Example Workflow -```console -marie@compute$ pip install ratarmount -``` +The software Ratarmount is installed system-wide on the HPC system. -After that, you can use ratarmount to mount a TAR file using the following approach: +The first step is to create a tar archive to bundle your small files in a single file. ```bash -marie@compute$ ratarmount <compressed_file> <mountpoint> +# On your local machine +marie@local$ tar cf dataset.tar folder_containing_my_small_files + +# If your small files are already on the HPC system +marie@login$ dttar cf dataset.tar folder_containing_my_small_files ``` -Thus, you could invoke ratarmount as follows: +For the latter, please make sure that you are on a [Datamover node](../data_transfer/datamover.md) +and **not** on a login node. +Depending on the number of files, the tar bundle process may take some time. -```console -marie@compute$ ratarmount inputdata.tar.gz input-folder +We do not recommend to compress (e.g. Gzip) the archive, as this can decrease the read performance substantially +e.g. for images, audio and video files. -# Now access the data as if it was a directory, e.g.: -marie@compute$ cat input-folder/input-file1 -``` +Once the tar archive has been created, you can mount it on the compute node using `ratarmount'. +All files in the mount points can be accessed as normal files or directories +in the filesystem without any special treatment. +Note that the tar archive must be mounted on every compute node in your job. + +!!! note + + Mounting an archive for the first time can take some time because Ratarmount has to create an index of its contents to access it efficiently. + The index, named `.<name_of_the_archive>.index.sqlite`, will be placed + in the same directory as the archive if the directory is writable, + otherwise ratarmount will try to place the index in your home directory. + This indexing step could be done in a separate job to save resources. + It also prevents conflicting indexing by more than one process at the same time. + + ```bash + # create index + sbatch --ntasks=1 --mem=10G --time=5:00:00 ratarmount dataset.tar + ``` + +!!! example "Example job script using Ratarmount" + + ```bash + #!/bin/bash + + #SBATCH --ntasks=3 + #SBATCH --nodes=2 + #SBATCH --time=00:05:00 + + + # mount the dataset on every node one time + DATASET=/tmp/${SLURM_JOB_ID} + srun --ntasks-per-node=1 mkdir ${DATASET} + srun --ntasks-per-node=1 ratarmount dataset.tar ${DATASET} + + # now it can be accessed like a normal directory + srun --ntasks=1 ls ${DATASET} + + # start the application + srun ./my_application --input-directory ${DATASET} + + # unmount it after all work is done + srun --ntasks-per-node=1 ratarmount -u ${DATASET} + ``` + +!!! hint + + If you are starting many processes per node, Ratarmount could benefit from + having individual mount points for each process, rather than just one per node. -Ratarmount is still in development, so if there are problems or if it is unexpectedly slow, +In case of Ratarmount issues please [open an issue](https://github.com/mxmlnkn/ratarmount/issues) on GitHub. There also is a library interface called