Add ClearML Agent Slurm Glue info (#736)

This commit is contained in:
pollfly 2023-12-24 14:50:56 +02:00 committed by GitHub
parent 793b463f68
commit 036621a729
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -327,6 +327,112 @@ There are two options for deploying the ClearML Agent to a Kubernetes cluster:
For more details, see [Kubernetes integration](https://github.com/allegroai/clearml-agent#kubernetes-integration-optional). For more details, see [Kubernetes integration](https://github.com/allegroai/clearml-agent#kubernetes-integration-optional).
### Slurm
:::important Enterprise Feature
Slurm Glue is available under the ClearML Enterprise plan
:::
Agents can be deployed bare-metal or inside [`Singularity`](https://docs.sylabs.io/guides/3.5/user-guide/introduction.html)
containers in linux clusters managed with Slurm.
ClearML Agent Slurm Glue maps jobs to Slurm batch scripts: associate a ClearML queue to a batch script template, then
when a Task is pushed into the queue, it will be converted and executed as an `sbatch` job according to the sbatch
template specification attached to the queue.
1. Install the Slurm Glue on a machine where you can run `sbatch` / `squeue` etc.
```
pip3 install -U --extra-index-url https://*****@*****.allegro.ai/repository/clearml_agent_slurm/simple clearml-agent-slurm
```
1. Create a new batch template. Make sure to set the `SBATCH` variables to the resources you want to attach to the queue.
The script below sets up an agent to run bare-metal, creating a virtual environment per job. For example:
```
#!/bin/bash
# available template variables (default value separator ":")
# ${CLEARML_QUEUE_NAME}
# ${CLEARML_QUEUE_ID}
# ${CLEARML_WORKER_ID}.
# complex template variables (default value separator ":")
# ${CLEARML_TASK.id}
# ${CLEARML_TASK.name}
# ${CLEARML_TASK.project.id}
# ${CLEARML_TASK.hyperparams.properties.user_key.value}
# example
#SBATCH --job-name=clearml_task_${CLEARML_TASK.id} # Job name DO NOT CHANGE
#SBATCH --ntasks=1 # Run on a single CPU
# #SBATCH --mem=1mb # Job memory request
# #SBATCH --time=00:05:00 # Time limit hrs:min:sec
#SBATCH --output=task-${CLEARML_TASK.id}-%j.log
#SBATCH --partition debug
#SBATCH --cpus-per-task=1
#SBATCH --priority=5
#SBATCH --nodes=${CLEARML_TASK.hyperparams.properties.num_nodes.value:1}
${CLEARML_PRE_SETUP}
echo whoami $(whoami)
${CLEARML_AGENT_EXECUTE}
${CLEARML_POST_SETUP}
```
Notice: If you are using Slurm with Singularity container support replace `${CLEARML_AGENT_EXECUTE}` in the batch
template with `singularity exec ${CLEARML_AGENT_EXECUTE}`. For additional required settings, see [Slurm with Singularity](#slurm-with-singularity).
:::tip
You can override the default values of a Slurm job template via the ClearML Web UI. The following command in the
template sets the `nodes` value to be the ClearML Tasks `num_nodes` user property:
```
#SBATCH --nodes=${CLEARML_TASK.hyperparams.properties.num_nodes.value:1}
```
This user property can be modified in the UI, in the task's **CONFIGURATION > User Properties** section, and when the
task is executed the new modified value will be used.
:::
3. Launch the ClearML Agent Slurm Glue and assign the Slurm configuration to a ClearML queue. For example, the following
associates the `default` queue to the `slurm.example.template` script, so any jobs pushed to this queue will use the
resources set by that script.
```
clearml-agent-slurm --template-files slurm.example.template --queue default
```
You can also pass multiple templates and queues. For example:
```
clearml-agent-slurm --template-files slurm.template1 slurm.template2 --queue queue1 queue2
```
#### Slurm with Singularity
If you are running Slurm with Singularity containers support, set the following:
1. Make sure your `sbatch` template contains:
```
singularity exec ${CLEARML_AGENT_EXECUTE}
```
Additional singularity arguments can be added, for example:
```
singularity exec --uts ${CLEARML_AGENT_EXECUTE}`
```
1. Set the default Singularity container to use in your [clearml.conf](configs/clearml_conf.md) file:
```
agent.default_docker.image="shub://repo/hello-world"
```
Or
```
agent.default_docker.image="docker://ubuntu"
```
1. Add `--singularity-mode` to the command line, for example:
```
clearml-agent-slurm --container-mode --template-files slurm.example_singularity.template --queue default
```
### Explicit Task Execution ### Explicit Task Execution
ClearML Agent can also execute specific tasks directly, without listening to a queue. ClearML Agent can also execute specific tasks directly, without listening to a queue.