Skip to content
Snippets Groups Projects
Commit 1e0846e1 authored by Jan Frenzel's avatar Jan Frenzel
Browse files

Merge branch 'merge-preview-in-main' into 'main'

Automated merge from preview to main

See merge request !1101
parents 2f7061a8 668bf6d5
No related branches found
No related tags found
5 merge requests!1116Main,!1114Foo,!1113Main,!1110Foobar,!1101Automated merge from preview to main
......@@ -26,7 +26,7 @@ Getting high I/O-bandwidth
- Use many processes (writing in the same file at the same time is possible)
- Use large I/O transfer blocks
- Avoid reading many small files. Use data container e. g.
[ratarmount](../software/utilities.md#direct-archive-access-without-extraction)
[ratarmount](../software/utilities.md#direct-archive-access-without-extraction-using-ratarmount)
to bundle small files into one
## Cheat Sheet for Debugging Filesystem Issues
......
......@@ -48,8 +48,8 @@ Getting high I/O-bandwidth
- Use many processes (writing in the same file at the same time is possible)
- Use large I/O transfer blocks
- Avoid reading many small files. Use data container e. g.
[ratarmount](../software/utilities.md#direct-archive-access-without-extraction) to bundle
small files into one
[ratarmount](../software/utilities.md#direct-archive-access-without-extraction-using-ratarmount)
to bundle small files into one
## Cheat Sheet for Debugging Filesystem Issues
......
......@@ -222,7 +222,7 @@ marie@compute$ tar --use-compress-program=rapidgzip -xf my-archive.tar.gz
Rapidgzip is still in development, so if it crashes or if it is slower than the system `gzip`,
please [open an issue](https://github.com/mxmlnkn/rapidgzip/issues) on GitHub.
### Direct Archive Access Without Extraction
### Direct Archive Access Without Extraction Using Ratarmount
In some cases of archives with millions of small files, it might not be feasible to extract the
whole archive to a filesystem.
......@@ -238,30 +238,79 @@ minutes per file access.
Furthermore, the analysis results of the archive will be stored in a sidecar file alongside the
archive or in your home directory if the archive is in a non-writable location.
Subsequent mounts instantly load that sidecar file instead of reanalyzing the archive.
You will find further information on the [Ratarmount GitHub page](https://github.com/mxmlnkn/ratarmount).
[Ratarmount](https://github.com/mxmlnkn/ratarmount) is available on PyPI and can be installed via pip.
It is recommended to install it inside a [Python virtual environment](python_virtual_environments.md).
#### Example Workflow
```console
marie@compute$ pip install ratarmount
```
The software Ratarmount is installed system-wide on the HPC system.
After that, you can use ratarmount to mount a TAR file using the following approach:
The first step is to create a tar archive to bundle your small files in a single file.
```bash
marie@compute$ ratarmount <compressed_file> <mountpoint>
# On your local machine
marie@local$ tar cf dataset.tar folder_containing_my_small_files
# If your small files are already on the HPC system
marie@login$ dttar cf dataset.tar folder_containing_my_small_files
```
Thus, you could invoke ratarmount as follows:
For the latter, please make sure that you are on a [Datamover node](../data_transfer/datamover.md)
and **not** on a login node.
Depending on the number of files, the tar bundle process may take some time.
```console
marie@compute$ ratarmount inputdata.tar.gz input-folder
We do not recommend to compress (e.g. Gzip) the archive, as this can decrease the read performance substantially
e.g. for images, audio and video files.
# Now access the data as if it was a directory, e.g.:
marie@compute$ cat input-folder/input-file1
```
Once the tar archive has been created, you can mount it on the compute node using `ratarmount'.
All files in the mount points can be accessed as normal files or directories
in the filesystem without any special treatment.
Note that the tar archive must be mounted on every compute node in your job.
!!! note
Mounting an archive for the first time can take some time because Ratarmount has to create an index of its contents to access it efficiently.
The index, named `.<name_of_the_archive>.index.sqlite`, will be placed
in the same directory as the archive if the directory is writable,
otherwise ratarmount will try to place the index in your home directory.
This indexing step could be done in a separate job to save resources.
It also prevents conflicting indexing by more than one process at the same time.
```bash
# create index
sbatch --ntasks=1 --mem=10G --time=5:00:00 ratarmount dataset.tar
```
!!! example "Example job script using Ratarmount"
```bash
#!/bin/bash
#SBATCH --ntasks=3
#SBATCH --nodes=2
#SBATCH --time=00:05:00
# mount the dataset on every node one time
DATASET=/tmp/${SLURM_JOB_ID}
srun --ntasks-per-node=1 mkdir ${DATASET}
srun --ntasks-per-node=1 ratarmount dataset.tar ${DATASET}
# now it can be accessed like a normal directory
srun --ntasks=1 ls ${DATASET}
# start the application
srun ./my_application --input-directory ${DATASET}
# unmount it after all work is done
srun --ntasks-per-node=1 ratarmount -u ${DATASET}
```
!!! hint
If you are starting many processes per node, Ratarmount could benefit from
having individual mount points for each process, rather than just one per node.
Ratarmount is still in development, so if there are problems or if it is unexpectedly slow,
In case of Ratarmount issues
please [open an issue](https://github.com/mxmlnkn/ratarmount/issues) on GitHub.
There also is a library interface called
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment