Commit ddade55e authored by Jan Frenzel's avatar Jan Frenzel
Browse files

Merge branch 'issue-362' into 'preview'

Add sections about ratarmount and pragzip

Closes #362

See merge request !649
parents 56e6f624 8b3ab9fe
......@@ -123,3 +123,106 @@ failed to connect to server
marie@login3$ ssh login4 tmux ls
marie_is_testing: 1 windows (created Tue Mar 29 19:06:26 2022) [105x32]
```
## Working with Large Archives and Compressed Files
### Parallel Gzip Decompression
There is a plethora of `gzip` tools but none of them can fully utilize multiple cores.
The fastest single-core decoder is `igzip` from the
[Intelligent Storage Acceleration Library](https://github.com/intel/isa-l.git).
In tests, it can reach ~500 MB/s compared to ~200 MB/s for the system-default `gzip`.
If you have very large files and need to decompress them even faster, you can use
[pragzip](https://github.com/mxmlnkn/pragzip).
Currently, it can reach ~1.5 GB/s using a 12-core processor in the above-mentioned tests.
[Pragzip](https://github.com/mxmlnkn/pragzip) is available on PyPI and can be installed via pip.
It is recommended to install it inside a
[Python virtual environment](python_virtual_environments.md).
```console
marie@compute$ pip install pragzip
```
It can also be installed from its C++ source code.
If you prefer that over the version on PyPI, then you can build it like this:
```console
marie@compute$ git clone https://github.com/mxmlnkn/pragzip.git
marie@compute$ cd pragzip
marie@compute$ mkdir build
marie@compute$ cd build
marie@compute$ cmake ..
marie@compute$ cmake --build . pragzip
marie@compute$ src/tools/pragzip --help
```
The built binary can then be used directly or copied inside a folder that is available in your
`PATH` environment variable.
Pragzip can be used like this:
```bash
marie@compute$ pragzip -d <file_to_decompress>
```
For example, if you want to decompress a file called `data.gz`, use:
```console
marie@compute$ pragzip -d data.gz
```
Furthermore, you can use it to speed up extraction of a file `my-archive.tar.gz` like this:
```console
marie@compute$ tar --use-compress-program=pragzip -xf my-archive.tar.gz
```
Pragzip is still in development, so if it crashes or if it is slower than the system `gzip`,
please [open an issue](https://github.com/mxmlnkn/pragzip/issues) on GitHub.
### Direct Archive Access Without Extraction
In some cases of archives with millions of small files, it might not be feasible to extract the
whole archive to a filesystem.
The known `archivemount` tool has performance problems with such archives even if they are simply
uncompressed TAR files.
Furthermore, with `archivemount` the archive would have to be reanalyzed whenever a new job is started.
`Ratarmount` is an alternative that solves these performance issues.
The archive will be analyzed and then can be accessed via a FUSE mountpoint showing the internal
folder hierarchy.
Access to files is consistently fast no matter the archive size while `archivemount` might take
minutes per file access.
Furthermore, the analysis results of the archive will be stored in a sidecar file alongside the
archive or in your home directory if the archive is in a non-writable location.
Subsequent mounts instantly load that sidecar file instead of reanalyzing the archive.
[Ratarmount](https://github.com/mxmlnkn/ratarmount) is available on PyPI and can be installed via pip.
It is recommended to install it inside a [Python virtual environment](python_virtual_environments.md).
```console
marie@compute$ pip install ratarmount
```
After that, you can use ratarmount to mount a TAR file using the following approach:
```bash
marie@compute$ ratarmount <compressed_file> <mountpoint>
```
Thus, you could invoke ratarmount as follows:
```console
marie@compute$ ratarmount inputdata.tar.gz input-folder
# Now access the data as if it was a directory, e.g.:
marie@compute$ cat input-folder/input-file1
```
Ratarmount is still in development, so if there are problems or if it is unexpectedly slow,
please [open an issue](https://github.com/mxmlnkn/ratarmount/issues) on GitHub.
There also is a library interface called
[ratarmountcore](https://github.com/mxmlnkn/ratarmount/tree/master/core#example) that works
fully without FUSE, which might make access to files from Python even faster.
personal_ws-1.1 en 464
personal_ws-1.1 en 471
Abaqus
ACL
ACLs
......@@ -129,6 +129,7 @@ GPUs
gres
GROMACS
GUIs
Gzip
Hackenberg
hadoop
haswell
......@@ -306,6 +307,8 @@ png
PowerAI
PowerShell
ppc
pragzip
Pragzip
pre
Pre
preload
......@@ -323,6 +326,7 @@ Pthreads
pty
PuTTY
pymdownx
PyPI
PythonAnaconda
pytorch
PyTorch
......@@ -331,6 +335,9 @@ queue
quickstart
Quickstart
randint
ratarmount
Ratarmount
ratarmountcore
reachability
README
Rechenautomat
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment