- May 01, 2021
-
-
Albert Gil authored
-
Albert Gil authored
Use test_name to avoid test_id. Use get_config_parameter to simplify checking requirements. Use submit_job to simplify code avoiding unnecessary spawn/expect. Use wait_for and new auxiliary function to avoid sleeps. Use subtest/subpass to avoid exit_code. Use tolerance to simplify subtests. Bug 10315 Signed-off-by:
Scott Jackson <scottmo@schedmd.com>
-
Albert Gil authored
Bug 10315 Signed-off-by:
Scott Jackson <scottmo@schedmd.com>
-
Scott Jackson authored
By creating the file to read as setup step instead that in one of the tasks. Bug 10315
-
Albert Gil authored
-
Scott Jackson authored
Bug 10604
-
- Apr 30, 2021
-
-
John Thiltges authored
print_job_select() included an unnecessary argument in a printf() call, causing a 'Redundant argument in printf' error to be shown. Bug 11491.
-
- Apr 29, 2021
-
-
Ben Roberts authored
For salloc, sbatch and srun Bug 11461 Signed-off-by:
Tim Wickberg <tim@schedmd.com>
-
Ben Roberts authored
Bug 11430 Signed-off-by:
Tim Wickberg <tim@schedmd.com>
-
Ben Roberts authored
Bug 11206 Signed-off-by:
Tim Wickberg <tim@schedmd.com>
-
Scott Jackson authored
-
Danny Auble authored
-
Marcin Stolarek authored
This looks like a typo in initial commit b71efa62 - lock should be released before return not locked again. Bug 11480
-
Marshall Garey authored
In a multi-node job it was possible to be in a situation where there were more CPUs available for steps to use but steps would not launch. For example, if a node has 2 cores and 1 thread per core and this job is submitted: sbatch -N2 --ntasks-per-node=2 --mem=1000 job.bash And job.bash contains the following: for i in {1..4} do srun --exact --mem=100 -N1 -c1 -n1 sleep 60 & done wait In this case, two steps would run on the first node and one step would run on the second node, but the fourth step would not run until the first step completed, even though there is an available task and CPU on the second node in the allocation. Why does this happen? If the step requests CPUs <= number of nodes, then when _pick_step_nodes() calls _pick_step_nodes_cpus: node_tmp = _pick_step_nodes_cpus(job_ptr, nodes_avail, nodes_needed, cpus_needed, usable_cpu_cnt); it will simply return the first N nodes from the nodes_avail bitmap, where N is the number of nodes that the step requested. In this example job, all the CPUs on the first node are allocated, but the first node remains in the nodes_avail bitmap. Then _pick_step_nodes_cpus() selects the first node and adds it to the nodes_picked bitmap. Right after that, _pick_step_nodes() gets the number of CPUs from nodes in the nodes_picked bitmap, which is 0 CPUs. The fix is to remove fully allocated nodes from nodes_avail bitmap. But this also creates a problem where once all the nodes are fully allocated and another valid step request comes, then an incorrect error message of ESLURM_REQUESTED_NODE_CONFIG_UNAVAILABLE would happen, when the correct error message is ESLURM_NODES_BUSY. So we increment job_blocked_nodes if there are no available cpus on the node. Bug 11357
-
Carlos Tripiana Montes authored
Continuation of 8475ae9d CID 221511 Bug 11401
-
Albert Gil authored
-
Scott Jackson authored
Fix job submission count message Bug 10439
-
Scott Jackson authored
Bug 10439
-
- Apr 28, 2021
-
-
Marcin Stolarek authored
Bug 11059
-
Ben Roberts authored
Add note in cgroup.conf to match note in slurm.conf Bug 11209 Signed-off-by:
Tim Wickberg <tim@schedmd.com>
-
Carlos Tripiana Montes authored
Jobs requesting resources that could fit in 1 leaf switch are incorrectly spread across switches. Fixing this code also makes "--switches" work again. select/cons_res already works according to documentation. Bug 11401
-
Michael Hinton authored
Bug 10944
-
Ben Roberts authored
Fix extra newlines that were causing links to be created where they weren't needed. Removed extra .TP that prevented link from being added. Correctly closed bold tag that prevented link. Bug 10944
-
Ben Roberts authored
Extra .LP tags and other white space created larger than normal gaps between certain paragraphs. Bug 10944
-
Tim Wickberg authored
-
Ben Roberts authored
Fix items with descriptions that started on the same lines. Address cases where some descriptions were indented more than the rest of the descriptions in the list. Bug 10944
-
Ben Roberts authored
Fixed for salloc, sbatch and srun Bug 10944
-
Ben Roberts authored
Bug 10944
-
- Apr 27, 2021
-
-
Tim McMullan authored
Send a 0 as the file length over the pipe to the parent to break out of the safe_read(), rather than hanging indefinitely since the child has already exited. Bug 11460
-
Tim Wickberg authored
-
Tim Wickberg authored
Update slurm.spec as well.
-
- Apr 26, 2021
-
-
Ben Roberts authored
Bug 11318
-
- Apr 23, 2021
-
-
Brian Christiansen authored
This reverts commit 69d5d94b. submit_dir != working dir In 7092, the customer was seeing the interactive srun set SLURM_SUBMIT_DIR to the cwd in the interactive srun after -D had been applied -- which is wrong behavior. This will be fixed in 21.08. Bug 11407
-
Ben Roberts authored
Bug 11447
-
Marcin Stolarek authored
Bug 11396
-
- Apr 21, 2021
-
-
Ben Roberts authored
Bug 11363 Signed-off-by:
Tim Wickberg <tim@schedmd.com>
-
Danny Auble authored
Bug 11093
-
Carlos Tripiana Montes authored
It can happen that child proc fails at some point, and returns some exit code different from 0. But parent didn't check this properly, and finished returning w/o error. Now, if child fails, parent too. This spanws back the error and we know this job will have a problem. Bug 11093
-
Carlos Tripiana Montes authored
container_p_restore get now the list of jobs running from the spool dir with stepd_available. Then, it iterates over basepath entries and, for those which seems to have been a mount point (has .ns file), tries to mount it again. If it succeeds (it must), and if for this mount point the job is dead, it releases resources and tries to delete files. Remember the removal can fail if a resource is leaked. These would be fixed if slurmd starts after HW reboot (no kernel leaks). Bug 11093
-
Carlos Tripiana Montes authored
Then, make _rm_data call in _create_ns keep old behavior, as here it will only be removed if something fails in creation, and no previous NS leak is possible. So force removal, or fail. But for _delete_ns, it could be called at job's end after slurmd got killed and restarted, thus having leaked the NS. Even though, slurmd can recreate the NS and mount it in the same place, the .ns file can't be removed at the end of the job because EBUSY. Bug 11093
-