diff --git a/doc.zih.tu-dresden.de/docs/software/utilities.md b/doc.zih.tu-dresden.de/docs/software/utilities.md index a9494a993817a5e2c8c65ac677c2aecffa4ee6ed..4407eef48bc7cb7088cd04bd6eade888ca48f803 100644 --- a/doc.zih.tu-dresden.de/docs/software/utilities.md +++ b/doc.zih.tu-dresden.de/docs/software/utilities.md @@ -165,3 +165,106 @@ module load hwloc/2.5.0-GCCcore-11.2.0 srun lstopo ``` + +## Working with Large Archives and Compressed Files + +### Parallel Gzip Decompression + +There is a plethora of `gzip` tools but none of them can fully utilize multiple cores. +The fastest single-core decoder is `igzip` from the +[Intelligent Storage Acceleration Library](https://github.com/intel/isa-l.git). +In tests, it can reach ~500 MB/s compared to ~200 MB/s for the system-default `gzip`. +If you have very large files and need to decompress them even faster, you can use +[pragzip](https://github.com/mxmlnkn/pragzip). +Currently, it can reach ~1.5 GB/s using a 12-core processor in the above-mentioned tests. + +[Pragzip](https://github.com/mxmlnkn/pragzip) is available on PyPI and can be installed via pip. +It is recommended to install it inside a +[Python virtual environment](python_virtual_environments.md). + +```console +marie@compute$ pip install pragzip +``` + +It can also be installed from its C++ source code. +If you prefer that over the version on PyPI, then you can build it like this: + +```console +marie@compute$ git clone https://github.com/mxmlnkn/pragzip.git +marie@compute$ cd pragzip +marie@compute$ mkdir build +marie@compute$ cd build +marie@compute$ cmake .. +marie@compute$ cmake --build . pragzip +marie@compute$ src/tools/pragzip --help +``` + +The built binary can then be used directly or copied inside a folder that is available in your +`PATH` environment variable. + +Pragzip can be used like this: + +```bash +marie@compute$ pragzip -d <file_to_decompress> +``` + +For example, if you want to decompress a file called `data.gz`, use: + +```console +marie@compute$ pragzip -d data.gz +``` + +Furthermore, you can use it to speed up extraction of a file `my-archive.tar.gz` like this: + +```console +marie@compute$ tar --use-compress-program=pragzip -xf my-archive.tar.gz +``` + +Pragzip is still in development, so if it crashes or if it is slower than the system `gzip`, +please [open an issue](https://github.com/mxmlnkn/pragzip/issues) on GitHub. + +### Direct Archive Access Without Extraction + +In some cases of archives with millions of small files, it might not be feasible to extract the +whole archive to a filesystem. +The known `archivemount` tool has performance problems with such archives even if they are simply +uncompressed TAR files. +Furthermore, with `archivemount` the archive would have to be reanalyzed whenever a new job is started. + +`Ratarmount` is an alternative that solves these performance issues. +The archive will be analyzed and then can be accessed via a FUSE mountpoint showing the internal +folder hierarchy. +Access to files is consistently fast no matter the archive size while `archivemount` might take +minutes per file access. +Furthermore, the analysis results of the archive will be stored in a sidecar file alongside the +archive or in your home directory if the archive is in a non-writable location. +Subsequent mounts instantly load that sidecar file instead of reanalyzing the archive. + +[Ratarmount](https://github.com/mxmlnkn/ratarmount) is available on PyPI and can be installed via pip. +It is recommended to install it inside a [Python virtual environment](python_virtual_environments.md). + +```console +marie@compute$ pip install ratarmount +``` + +After that, you can use ratarmount to mount a TAR file using the following approach: + +```bash +marie@compute$ ratarmount <compressed_file> <mountpoint> +``` + +Thus, you could invoke ratarmount as follows: + +```console +marie@compute$ ratarmount inputdata.tar.gz input-folder + +# Now access the data as if it was a directory, e.g.: +marie@compute$ cat input-folder/input-file1 +``` + +Ratarmount is still in development, so if there are problems or if it is unexpectedly slow, +please [open an issue](https://github.com/mxmlnkn/ratarmount/issues) on GitHub. + +There also is a library interface called +[ratarmountcore](https://github.com/mxmlnkn/ratarmount/tree/master/core#example) that works +fully without FUSE, which might make access to files from Python even faster. diff --git a/doc.zih.tu-dresden.de/mkdocs.yml b/doc.zih.tu-dresden.de/mkdocs.yml index fbff6c65f63d62e4abdae0f243b6d7ecefbe84e6..34e6df3346bb503aadd92b07ad46565b851a7912 100644 --- a/doc.zih.tu-dresden.de/mkdocs.yml +++ b/doc.zih.tu-dresden.de/mkdocs.yml @@ -106,8 +106,7 @@ nav: - Checkpoint/Restart: jobs_and_resources/checkpoint_restart.md - Job Profiling: jobs_and_resources/slurm_profiling.md - Binding And Distribution Of Tasks: jobs_and_resources/binding_and_distribution_of_tasks.md - - Support: - - How to Ask for Support: support/support.md + - User Support: support/support.md - Archive: - Overview: archive/overview.md - Bio Informatics: archive/bioinformatics.md @@ -133,7 +132,7 @@ nav: - UNICORE Rest API: archive/unicore_rest_api.md - VampirTrace: archive/vampirtrace.md - Windows Batchjobs: archive/windows_batch.md - - Contribute: + - Contribute to Documentation: - How-To Contribute: contrib/howto_contribute.md - Content Rules: contrib/content_rules.md - Contribute via Browser: contrib/contribute_browser.md diff --git a/doc.zih.tu-dresden.de/tud_theme/stylesheets/extra.css b/doc.zih.tu-dresden.de/tud_theme/stylesheets/extra.css index 6b855870a148cc8a08d731561c5ee82d1d527695..f5664bf72c87decc408c4369fba768e4a010bc3e 100644 --- a/doc.zih.tu-dresden.de/tud_theme/stylesheets/extra.css +++ b/doc.zih.tu-dresden.de/tud_theme/stylesheets/extra.css @@ -246,3 +246,9 @@ p { .highlight .gp { user-select: none; } + +.md-typeset table:not([class]) th a{ + line-height: 170%; + color: var(--tud-red-90); +} + diff --git a/doc.zih.tu-dresden.de/wordlist.aspell b/doc.zih.tu-dresden.de/wordlist.aspell index c1ce446898517f2bee881f568e7f51aff9fe41f1..5968895cac41fb9d400abe91bb0c967bb5d2483c 100644 --- a/doc.zih.tu-dresden.de/wordlist.aspell +++ b/doc.zih.tu-dresden.de/wordlist.aspell @@ -1,4 +1,4 @@ -personal_ws-1.1 en 464 +personal_ws-1.1 en 471 Abaqus ACL ACLs @@ -129,6 +129,7 @@ GPUs gres GROMACS GUIs +Gzip Hackenberg hadoop haswell @@ -307,6 +308,8 @@ png PowerAI PowerShell ppc +pragzip +Pragzip pre Pre preload @@ -324,6 +327,7 @@ Pthreads pty PuTTY pymdownx +PyPI PythonAnaconda pytorch PyTorch @@ -332,6 +336,9 @@ queue quickstart Quickstart randint +ratarmount +Ratarmount +ratarmountcore reachability README Rechenautomat