Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
hpc-compendium
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Deploy
Releases
Package registry
Container Registry
Model registry
Operate
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
ZIH
hpcsupport
hpc-compendium
Commits
3c07f39f
Commit
3c07f39f
authored
3 years ago
by
LocNgu
Browse files
Options
Downloads
Patches
Plain Diff
review checkpoint_restart.md
check spellling, formatting, links, lint
parent
98cf0684
No related branches found
No related tags found
3 merge requests
!322
Merge preview into main
,
!319
Merge preview into main
,
!281
review checkpoint_restart.md
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc.zih.tu-dresden.de/docs/jobs_and_resources/checkpoint_restart.md
+39
-36
39 additions, 36 deletions
...-dresden.de/docs/jobs_and_resources/checkpoint_restart.md
with
39 additions
and
36 deletions
doc.zih.tu-dresden.de/docs/jobs_and_resources/checkpoint_restart.md
+
39
−
36
View file @
3c07f39f
...
@@ -11,17 +11,16 @@ Espresso, STAR-CCM+, VASP
...
@@ -11,17 +11,16 @@ Espresso, STAR-CCM+, VASP
In case your program does not natively support checkpointing, there are
In case your program does not natively support checkpointing, there are
attempts at creating generic checkpoint/restart solutions that should
attempts at creating generic checkpoint/restart solutions that should
work application-agnostic. One such project which we recommend is DMTCP:
work application-agnostic. One such project which we recommend is DMTCP:
Distributed MultiThreaded CheckPointing
[
Distributed MultiThreaded CheckPointing
](
<
http://dmtcp.sourceforge.net
>
)
.
(
<http://dmtcp.sourceforge.net>
).
It is available on ZIH systems after having loaded the "dmtcp" module:
It is available on ZIH systems after having loaded the "dmtcp" module:
```
console
```
console
module load DMTCP
marie@login$
module load DMTCP
```
```
While our batch system of choice, Slurm, also provides a checkpointing
While our batch system of choice, Slurm, also provides a checkpointing
interface to the user, unfortunately it does not yet support DMTCP at
interface to the user, unfortunately
,
it does not yet support DMTCP at
this time. However, there are ongoing efforts of writing a Slurm plugin
this time. However, there are ongoing efforts of writing a Slurm plugin
that hopefully will change this in the near future. We will update this
that hopefully will change this in the near future. We will update this
documentation as soon as it becomes available.
documentation as soon as it becomes available.
...
@@ -51,21 +50,22 @@ parameters `--ib --rm` and put it between srun and your application
...
@@ -51,21 +50,22 @@ parameters `--ib --rm` and put it between srun and your application
call, e.g.:
call, e.g.:
```
console
```
console
srun dmtcp_launch --ib --rm ./my-mpi-application
marie@login$
srun dmtcp_launch
--ib
--rm
./my-mpi-application
```
```
`Note:`
we have successfully tested checkpointing MPI applications with
!!! note
the latest
`Intel MPI`
(module: intelmpi/2018.0.128). While it might
We have successfully tested checkpointing MPI applications with
work with other MPI libraries, too, we have no experience in this
the latest
`Intel MPI`
(module: intelmpi/2018.0.128). While it might
regard, so you should always try it out before using it for your
work with other MPI libraries, too, we have no experience in this
productive jobs.
regard, so you should always try it out before using it for your
productive jobs.
Then just substitute your usual
`sbatch`
call with
`dmtcp_sbatch`
and be
Then just substitute your usual
`sbatch`
call with
`dmtcp_sbatch`
and be
sure to specify the
`-t`
and
`-i`
parameters (don't forget you need to
sure to specify the
`-t`
and
`-i`
parameters (don't forget you need to
have loaded the dmtcp module).
have loaded the dmtcp module).
```
console
```
console
dmtcp_sbatch -
t
2-00:00:00 -
i
28000,800 my_batchfile.sh
marie@login$
dmtcp_sbatch
-
-time
2-00:00:00
-
-interval
28000,800 my_batchfile.sh
```
```
With
`-t|--time`
you set the total runtime of your calculation overall
With
`-t|--time`
you set the total runtime of your calculation overall
...
@@ -80,9 +80,9 @@ out the checkpoint files, separated from the interval time via comma
...
@@ -80,9 +80,9 @@ out the checkpoint files, separated from the interval time via comma
In the above example, there will be 6 jobs each running 8 hours, so
In the above example, there will be 6 jobs each running 8 hours, so
about 2 days in total.
about 2 days in total.
Hints
:
!!!
Hints
-
If you see your first job running into the timelimit, that probably
- If you see your first job running into the timelimit, that probably
means the timeout for writing out checkpoint files does not suffice
means the timeout for writing out checkpoint files does not suffice
and should be increased. Our tests have shown that it takes
and should be increased. Our tests have shown that it takes
approximately 5 minutes to write out the memory content of a fully
approximately 5 minutes to write out the memory content of a fully
...
@@ -91,14 +91,13 @@ Hints:
...
@@ -91,14 +91,13 @@ Hints:
depending on how much memory your application uses. If your memory
depending on how much memory your application uses. If your memory
content is rather incompressible, it might be a good idea to disable
content is rather incompressible, it might be a good idea to disable
the checkpoint file compression by setting: `export DMTCP_GZIP=0`
the checkpoint file compression by setting: `export DMTCP_GZIP=0`
-
Note that all jobs the script deems necessary for your chosen
- Note that all jobs the script deems necessary for your chosen
timelimit/interval values are submitted right when first calling the
timelimit/interval values are submitted right when first calling the
script. If your applications take considerably less time than what
script. If your applications take considerably less time than what
you specified, some of the individual jobs will be unnecessary. As
you specified, some of the individual jobs will be unnecessary. As
soon as one job does not find a checkpoint to resume from, it will
soon as one job does not find a checkpoint to resume from, it will
cancel all subsequent jobs for you.
cancel all subsequent jobs for you.
-
See
`dmtcp_sbatch -h`
for a list of available parameters and more
- See `dmtcp_sbatch -h` for a list of available parameters and more help
help
What happens in your work directory?
What happens in your work directory?
...
@@ -140,18 +139,20 @@ can be useful if you wish to implement some sort of job chaining on your
...
@@ -140,18 +139,20 @@ can be useful if you wish to implement some sort of job chaining on your
own. 1 In front of your program call, you have to add the wrapper
own. 1 In front of your program call, you have to add the wrapper
script:
`dmtcp_launch`
**TODO check**
script:
`dmtcp_launch`
**TODO check**
```
bash
???+ example
#/bin/bash
#SBATCH --time=00:01:00
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=1500
source
$DMTCP_ROOT
/bin/bash start_coordinator
-i
40
--exit-after-ckpt
```bash
#/bin/bash
#SBATCH --time=00:01:00
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=1500
dmtcp_launch ./my-application
#for sequential/multithreaded applications
source $DMTCP_ROOT/bin/bash start_coordinator -i 40 --exit-after-ckpt
#or: srun dmtcp_launch --ib --rm ./my-mpi-application #for MPI
applications
dmtcp_launch ./my-application #for sequential/multithreaded applications
```
#or: srun dmtcp_launch --ib --rm ./my-mpi-application #for MPI
applications
```
This will create a checkpoint automatically after 40 seconds and then
This will create a checkpoint automatically after 40 seconds and then
terminate your application and with it the job. If the job runs into its
terminate your application and with it the job. If the job runs into its
...
@@ -166,16 +167,18 @@ original job. If you do not wish to create another checkpoint in your
...
@@ -166,16 +167,18 @@ original job. If you do not wish to create another checkpoint in your
restarted run again, you can omit the
`-i`
and
`--exit-after-ckpt`
restarted run again, you can omit the
`-i`
and
`--exit-after-ckpt`
parameters this time. Afterwards, the application must be run using the
parameters this time. Afterwards, the application must be run using the
restart script, specifying the host and port of the coordinator (they
restart script, specifying the host and port of the coordinator (they
have been exported by the start_coordinator function).
Example:
have been exported by the start_coordinator function).
```
bash
???+ example
#/bin/bash
#SBATCH --time=00:01:00
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=1500
source
$DMTCP_ROOT
/bin/bash start_coordinator
-i
40
--exit-after-ckpt
```bash
#/bin/bash
#SBATCH --time=00:01:00
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=1500
./dmtcp_restart_script.sh
-h
$DMTCP_COORD_HOST
-p
source $DMTCP_ROOT/bin/bash start_coordinator -i 40 --exit-after-ckpt
$DMTCP_COORD_PORT
```
./dmtcp_restart_script.sh -h $DMTCP_COORD_HOST -p
$DMTCP_COORD_PORT
```
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment