If we’ve used supercomputers, we’ve probably dealt with queueing systems like Slurm. It’s very convenient — we just submit jobs and let the scheduler take care of the rest. I decided to install it on my personal workstation as well.
That said, I’m not running a cluster — just a single workstation (PC). So I skipped authentication (like munge) and went for the bare minimum setup. My lab network is isolated from the outside world, and no one else uses this machine, so I’m ignoring security concerns. If you’re following this setup, proceed with caution.
Install via apt
We can install it using apt.
$ sudo apt install slurm-wlmmunge will also be installed by default, but we won’t be using it.
Create slurm.conf
$ sudo vim /etc/slurm/slurm.confHere’s a minimal configuration.
ClusterName=localControlMachine=hostnameNodeName=hostnamePartitionName=main Nodes=hostname Default=YES MaxTime=INFINITE State=UP
SlurmctldPort=6817SlurmdPort=6818AuthType=auth/noneSlurmUser=slurmStateSaveLocation=/var/spool/slurmSlurmdSpoolDir=/var/spool/slurmdSwitchType=switch/noneTaskPlugin=task/noneThe hostname should match the value shown by hostname -s.
Technically, NodeName can include properties like CPUs, but since I’m not dividing resources, I left it blank. Running slurmd -C will output system info, so Slurm may auto-detect the specs. If you need resource partitioning, you may want to explicitly set those values.
Set AuthType=auth/none.
Create necessary directories and set permissions
$ sudo mkdir -p /var/spool/slurm$ sudo mkdir -p /var/spool/slurmd$ sudo chown -R slurm: /var/spool/slurm /var/spool/slurmdDisable munge
Since we’re using auth/none, munge isn’t required. It doesn’t hurt to leave it running, but I disabled it just in case.
$ sudo systemctl disable --now mungeStart Slurm
$ sudo systemctl enable --now slurmctld$ sudo systemctl enable --now slurmdCheck that it’s running correctly.
$ sinfoPARTITION AVAIL TIMELIMIT NODES STATE NODELISTmain* up infinite 1 idle localhostSubmit a job
Try submitting a test job.
$ set +H$ echo -e "#!/bin/bash\necho Hello, Slurm!" > test.sh$ chmod +x test.sh$ sbatch test.sh$ squeue$ cat slurm-*.outFixing STATE=DOWN after reboot
Sometimes, after a reboot, node STATE appears as DOWN. You can reset it with the command below, although the cause remains unclear.
sudo scontrol update nodename=hostname state=idlePrioritize a job (job preemption)
If you’ve submitted many jobs and want to prioritize a new one urgently, you can do the following:
Submit the job as usual.
$ sbatch job.shAdjust its priority.
$ sudo scontrol update jobid=<jobid> Nice=-10By default, Nice is set to 0. Lower values (negative) are prioritized. You’ll need sudo to change it.
Using sacct
If we want to view job history using sacct, we’ll need the following setup.
$ sudo apt install slurmdbd mysql-server-8.0$ sudo service mysql start$ sudo mysql -u root(mysql) CREATE DATABASE slurm_acct_db;(mysql) CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'パスワード';(mysql) GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost';(mysql) FLUSH PRIVILEGES;Add this to /etc/slurm/slurmdbd.conf.
AuthType=auth/noneDbdHost=localhostDbdPort=6819
StorageType=accounting_storage/mysqlStorageHost=localhostStoragePass=パスワードStorageUser=slurmStorageLoc=slurm_acct_db
LogFile=/var/log/slurmdbd.logPidFile=/var/run/slurmdbd.pidSlurmUser=slurmChange the file’s ownership and permissions accordingly.
$ sudo chown slurm: /etc/slurm/slurmdbd.conf$ sudo chmod 600 /etc/slurm/slurmdbd.confThen, add the following lines to /etc/slurm/slurm.conf.
AccountingStorageType=accounting_storage/slurmdbdAccountingStorageHost=<slurmdbdが動くホスト名>Start slurmdbd, and restart slurmctld and slurmd.
sudo systemctl start slurmdbdsudo systemctl restart slurmctld slurmdTry running sacct.
$ sacct -o User,JobID,Partition,NNodes,Submit,Start,End,Elapsed,State -XStop Slurm after current jobs finish (e.g., for maintenance)
We can suspend new jobs after the current ones complete.
$ sudo scontrol update NodeName=<ノード名> State=DRAIN Reason="Maintenance after current job"To revert this behavior, run the following.
$ sudo scontrol update NodeName=<ノード名> State=RESUME