Provide a unified SLURM jobscript (!42) · Merge requests · python / jupyter_on_HPC_setup_guide

Katharina Höflich requested to merge provide-unified-slurm-jobscript into master Jan 25, 2021

This MR implements features for the hlrn-goettingen-jupyterlab.sh job script that were only available in the nesh-linux-cluster-jupyterlab.sh script, namely (1) calculation/display of the remaining job elapse time (and job end date), as well as a (2) robust fetching of the JupyterLab server address (based on network location instead of token, see #34 (closed) and !37 (merged) why this was implemented).

This MR is mainly based on a job script that I was using on JUWELS a few times. On JUWELS, it was necessary to specify the network location explicitely (default hostnames are not working!) to be able to connect to the Jupyter server. As internal network visibility is also one of the first things to look for if connecting to a Jupyter server on a compute nodes fails and therefore especially to make debugging more straightforward, I have now implemented a separate section with a VISIBLE_NETWORK_LOCATION environment variable that is used to start the JupyterLab server in the lower part.

Then, while working with the script especially on the new NESH system, I experienced confusing reporting behaviour during the lifecycle of a JupyterLab on the compute nodes, especially if the job has terminated (for any reason). Also, from my experience it seems that on the single HPC systems (NESH, JUWELS, HLRN-G, ...) jobs tend to linger around in different SLURM job states for very different amounts of time, which makes it a bit difficult to come up with common assumptions that could be used for a minimalistic solution... to make the job script user output more resilient against these differences and to come up with a unified SLURM job script I have implemented a more fine-grained reporting behaviour for the most typical job states (PENDING, CONFIGURING, RUNNING, COMPLETING or COMPLETED), as well as a proper "everything else" job state case, that I could also always produce (by terminating the JupyterLab server in different ways...).

Finally, I did a bit of housekeeping: I renamed the hlrn-goettingen-jupyterlab.sh as it is now tested to be suitable for at least NESH, HLRN-Göttingen and JUWELS, and deleted the old PBS job script we used on NESH.

Provide a unified SLURM jobscript

Merge request reports