Tutorial#

You can use enjoy-slurm to submit and manage Slurm jobs in python.

NOTE: This tutorials was run at the DKRZ Levante. You will have to adapt your partition names and, of course, account if you want to run the tutorial somewhere else.

Let’s assume you have a bash test.sh:

#!/bin/sh
echo "Hello World from $(hostname)"

You can submit this using sbatch. Afterwards, we will immediately retrieve some information using scontrol. Note, that scontrol.show usually only works as long as the job is not completed yet.

[1]:
import enjoy_slurm as slurm

jobid = slurm.sbatch("test.sh", account="ch0636", partition="shared")
jobinfo = slurm.scontrol.show(jobid=jobid)
jobid
[1]:
4248273

Now you can check the state of your job using sacct:

[2]:
slurm.sacct(jobid)
[2]:
JobID Elapsed NCPUS NTasks State End JobName
0 4248273 00:00:00 1 NaN PENDING Unknown test.sh

Let’s have a look at the job info while the job is pending

[3]:
jobinfo[str(jobid)].keys()
[3]:
dict_keys(['JobId', 'JobName', 'UserId', 'GroupId', 'MCS_label', 'Priority', 'Nice', 'Account', 'QOS', 'JobState', 'Reason', 'Dependency', 'Requeue', 'Restarts', 'BatchFlag', 'Reboot', 'ExitCode', 'RunTime', 'TimeLimit', 'TimeMin', 'SubmitTime', 'EligibleTime', 'AccrueTime', 'StartTime', 'EndTime', 'Deadline', 'SuspendTime', 'SecsPreSuspend', 'LastSchedEval', 'Partition', 'AllocNode:Sid', 'ReqNodeList', 'ExcNodeList', 'NodeList', 'NumNodes', 'NumCPUs', 'NumTasks', 'CPUs/Task', 'ReqB:S:C:T', 'TRES', 'Socks/Node', 'NtasksPerN:B:S:C', 'CoreSpec', 'MinCPUsNode', 'MinMemoryCPU', 'MinTmpDiskNode', 'Features', 'DelayBoot', 'OverSubscribe', 'Contiguous', 'Licenses', 'Network', 'Command', 'WorkDir', 'StdErr', 'StdIn', 'StdOut', 'Power'])

Meanwhile the job should have completed:

[4]:
slurm.sacct(jobid)
[4]:
JobID Elapsed NCPUS NTasks State End JobName
0 4248273 00:00:15 2 NaN COMPLETED 2023-03-15T10:19:10 test.sh
1 4248273.batch 00:00:15 2 1.0 COMPLETED 2023-03-15T10:19:10 batch
2 4248273.extern 00:00:15 2 1.0 COMPLETED 2023-03-15T10:19:10 extern

Let’s check the logfile content

[5]:
def get_log(logfile):
    with open(logfile) as f:
        log = f.read().splitlines()[0]
    return log


logfile = jobinfo[str(jobid)].get("StdOut")
get_log(logfile)
[5]:
'Hello World from l40000.lvt.dkrz.de'

enjoy-slurm becomes more useful if you want to manage more jobs which becomes easy in python, e.g.

[13]:
jobinfo = {}

for i in range(0, 10):
    jobid = slurm.sbatch("test.sh", account="ch0636", partition="shared")
    jobinfo[jobid] = slurm.scontrol.show(jobid=jobid)[str(jobid)]

Check the accounting:

[14]:
slurm.sacct(name="test.sh", state="PENDING")
[14]:
JobID JobName Partition Account AllocCPUS State ExitCode
0 4248312 test.sh shared ch0636 1 PENDING 0:0
1 4248313 test.sh shared ch0636 1 PENDING 0:0
2 4248314 test.sh shared ch0636 1 PENDING 0:0
3 4248315 test.sh shared ch0636 1 PENDING 0:0
4 4248316 test.sh shared ch0636 1 PENDING 0:0
5 4248317 test.sh shared ch0636 1 PENDING 0:0
6 4248318 test.sh shared ch0636 1 PENDING 0:0
7 4248319 test.sh shared ch0636 1 PENDING 0:0
8 4248320 test.sh shared ch0636 1 PENDING 0:0
9 4248321 test.sh shared ch0636 1 PENDING 0:0
[15]:
jobinfo.keys()
[15]:
dict_keys([4248312, 4248313, 4248314, 4248315, 4248316, 4248317, 4248318, 4248319, 4248320, 4248321])

And finally, let’s print the log contents

[16]:
logs = {}

for jobid, info in jobinfo.items():
    logs[jobid] = get_log(info.get("StdOut"))
[17]:
logs
[17]:
{4248312: 'Hello World from l40000.lvt.dkrz.de',
 4248313: 'Hello World from l40000.lvt.dkrz.de',
 4248314: 'Hello World from l40000.lvt.dkrz.de',
 4248315: 'Hello World from l40000.lvt.dkrz.de',
 4248316: 'Hello World from l40000.lvt.dkrz.de',
 4248317: 'Hello World from l40000.lvt.dkrz.de',
 4248318: 'Hello World from l40000.lvt.dkrz.de',
 4248319: 'Hello World from l40000.lvt.dkrz.de',
 4248320: 'Hello World from l40000.lvt.dkrz.de',
 4248321: 'Hello World from l40000.lvt.dkrz.de'}