mirror of
https://github.com/clearml/clearml-docs
synced 2025-05-07 14:25:37 +00:00
Restructure ClearML Agent pages (#873)
This commit is contained in:
parent
b31452f6a1
commit
f6781628e0
@ -24,7 +24,7 @@ VS Code remote sessions use ports 8878 and 8898 respectively.
|
||||
|
||||
## Prerequisites
|
||||
* `clearml` installed and configured. See [Getting Started](../getting_started/ds/ds_first_steps.md) for details.
|
||||
* At least one `clearml-agent` running on a remote host. See [installation](../clearml_agent.md#installation) for details.
|
||||
* At least one `clearml-agent` running on a remote host. See [installation](../clearml_agent/clearml_agent_setup.md#installation) for details.
|
||||
* An SSH client installed on your machine. To verify, open your terminal and execute `ssh`. If you did not receive an
|
||||
error, you are good to go.
|
||||
|
||||
@ -142,7 +142,7 @@ sessions:
|
||||
maxServices: 20
|
||||
```
|
||||
|
||||
For more information, see [Kubernetes](../clearml_agent.md#kubernetes).
|
||||
For more information, see [Kubernetes](../clearml_agent/clearml_agent_deployment.md#kubernetes).
|
||||
|
||||
|
||||
### Installing Requirements
|
||||
|
File diff suppressed because it is too large
Load Diff
271
docs/clearml_agent/clearml_agent_deployment.md
Normal file
271
docs/clearml_agent/clearml_agent_deployment.md
Normal file
@ -0,0 +1,271 @@
|
||||
---
|
||||
title: Deployment
|
||||
---
|
||||
|
||||
## Spinning Up an Agent
|
||||
You can spin up an agent on any machine: on-prem and/or cloud instance. When spinning up an agent, you assign it to
|
||||
service a queue(s). Utilize the machine by enqueuing tasks to the queue that the agent is servicing, and the agent will
|
||||
pull and execute the tasks.
|
||||
|
||||
:::tip cross-platform execution
|
||||
ClearML Agent is platform agnostic. When using the ClearML Agent to execute experiments cross-platform, set platform
|
||||
specific environment variables before launching the agent.
|
||||
|
||||
For example, to run an agent on an ARM device, set the core type environment variable before spinning up the agent:
|
||||
|
||||
```bash
|
||||
export OPENBLAS_CORETYPE=ARMV8
|
||||
clearml-agent daemon --queue <queue_name>
|
||||
```
|
||||
:::
|
||||
|
||||
### Executing an Agent
|
||||
To execute an agent, listening to a queue, run:
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --queue <queue_name>
|
||||
```
|
||||
|
||||
### Executing in Background
|
||||
To execute an agent in the background, run:
|
||||
```bash
|
||||
clearml-agent daemon --queue <execution_queue_to_pull_from> --detached
|
||||
```
|
||||
### Stopping Agents
|
||||
To stop an agent running in the background, run:
|
||||
```bash
|
||||
clearml-agent daemon <arguments> --stop
|
||||
```
|
||||
|
||||
### Allocating Resources
|
||||
To specify GPUs associated with the agent, add the `--gpus` flag.
|
||||
To execute multiple agents on the same machine (usually assigning GPU for the different agents), run:
|
||||
```bash
|
||||
clearml-agent daemon --detached --queue default --gpus 0
|
||||
clearml-agent daemon --detached --queue default --gpus 1
|
||||
```
|
||||
To allocate more than one GPU, provide a list of allocated GPUs
|
||||
```bash
|
||||
clearml-agent daemon --gpus 0,1 --queue dual_gpu
|
||||
```
|
||||
|
||||
### Queue Prioritization
|
||||
A single agent can listen to multiple queues. The priority is set by their order.
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --detached --queue high_q low_q --gpus 0
|
||||
```
|
||||
This ensures the agent first tries to pull a Task from the `high_q` queue, and only if it is empty, the agent will try to pull
|
||||
from the `low_q` queue.
|
||||
|
||||
To make sure an agent pulls from all queues equally, add the `--order-fairness` flag.
|
||||
```bash
|
||||
clearml-agent daemon --detached --queue group_a group_b --order-fairness --gpus 0
|
||||
```
|
||||
It will make sure the agent will pull from the `group_a` queue, then from `group_b`, then back to `group_a`, etc. This ensures
|
||||
that `group_a` or `group_b` will not be able to starve one another of resources.
|
||||
|
||||
### SSH Access
|
||||
By default, ClearML Agent maps the host's `~/.ssh` into the container's `/root/.ssh` directory (configurable,
|
||||
see [clearml.conf](../configs/clearml_conf.md#docker_internal_mounts)).
|
||||
|
||||
If you want to use existing auth sockets with ssh-agent, you can verify your host ssh-agent is working correctly with:
|
||||
|
||||
```commandline
|
||||
echo $SSH_AUTH_SOCK
|
||||
```
|
||||
|
||||
You should see a path to a temporary file, something like this:
|
||||
|
||||
```console
|
||||
/tmp/ssh-<random>/agent.<random>
|
||||
```
|
||||
|
||||
Then run your `clearml-agent` in Docker mode, which will automatically detect the `SSH_AUTH_SOCK` environment variable,
|
||||
and mount the socket into any container it spins.
|
||||
|
||||
You can also explicitly set the `SSH_AUTH_SOCK` environment variable when executing an agent. The command below will
|
||||
execute an agent in Docker mode and assign it to service a queue. The agent will have access to
|
||||
the SSH socket provided in the environment variable.
|
||||
|
||||
```
|
||||
SSH_AUTH_SOCK=<file_socket> clearml-agent daemon --gpus <your config> --queue <your queue name> --docker
|
||||
```
|
||||
|
||||
## Kubernetes
|
||||
Agents can be deployed bare-metal or as dockers in a Kubernetes cluster. ClearML Agent adds the missing scheduling
|
||||
capabilities to Kubernetes, allows for more flexible automation from code, and gives access to all of ClearML Agent's
|
||||
features.
|
||||
|
||||
ClearML Agent is deployed onto a Kubernetes cluster through its Kubernetes-Glue which maps ClearML jobs directly to K8s
|
||||
jobs:
|
||||
* Use the [ClearML Agent Helm Chart](https://github.com/allegroai/clearml-helm-charts/tree/main/charts/clearml-agent) to
|
||||
spin an agent pod acting as a controller. Alternatively (less recommended) run a [k8s glue script](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py)
|
||||
on a K8S cpu node
|
||||
* The ClearML K8S glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided yaml
|
||||
template)
|
||||
* Inside each job pod the `clearml-agent` will install the ClearML task's environment and run and monitor the experiment's
|
||||
process
|
||||
|
||||
:::important Enterprise Feature
|
||||
The ClearML Enterprise plan supports K8S servicing multiple ClearML queues, as well as providing a pod template for each
|
||||
queue for describing the resources for each pod to use.
|
||||
|
||||
For example, the following configures which resources to use for `example_queue_1` and `example_queue_2`:
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
queues:
|
||||
example_queue_1:
|
||||
templateOverrides:
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
nodeSelector:
|
||||
nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
|
||||
example_queue_2:
|
||||
templateOverrides:
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 2
|
||||
nodeSelector:
|
||||
nvidia.com/gpu.product: A100-SXM4-40GB
|
||||
```
|
||||
:::
|
||||
|
||||
## Slurm
|
||||
|
||||
:::important Enterprise Feature
|
||||
Slurm Glue is available under the ClearML Enterprise plan
|
||||
:::
|
||||
|
||||
Agents can be deployed bare-metal or inside [`Singularity`](https://docs.sylabs.io/guides/3.5/user-guide/introduction.html)
|
||||
containers in linux clusters managed with Slurm.
|
||||
|
||||
ClearML Agent Slurm Glue maps jobs to Slurm batch scripts: associate a ClearML queue to a batch script template, then
|
||||
when a Task is pushed into the queue, it will be converted and executed as an `sbatch` job according to the sbatch
|
||||
template specification attached to the queue.
|
||||
|
||||
1. Install the Slurm Glue on a machine where you can run `sbatch` / `squeue` etc.
|
||||
|
||||
```
|
||||
pip3 install -U --extra-index-url https://*****@*****.allegro.ai/repository/clearml_agent_slurm/simple clearml-agent-slurm
|
||||
```
|
||||
|
||||
1. Create a batch template. Make sure to set the `SBATCH` variables to the resources you want to attach to the queue.
|
||||
The script below sets up an agent to run bare-metal, creating a virtual environment per job. For example:
|
||||
|
||||
```
|
||||
#!/bin/bash
|
||||
# available template variables (default value separator ":")
|
||||
# ${CLEARML_QUEUE_NAME}
|
||||
# ${CLEARML_QUEUE_ID}
|
||||
# ${CLEARML_WORKER_ID}.
|
||||
# complex template variables (default value separator ":")
|
||||
# ${CLEARML_TASK.id}
|
||||
# ${CLEARML_TASK.name}
|
||||
# ${CLEARML_TASK.project.id}
|
||||
# ${CLEARML_TASK.hyperparams.properties.user_key.value}
|
||||
|
||||
|
||||
# example
|
||||
#SBATCH --job-name=clearml_task_${CLEARML_TASK.id} # Job name DO NOT CHANGE
|
||||
#SBATCH --ntasks=1 # Run on a single CPU
|
||||
# #SBATCH --mem=1mb # Job memory request
|
||||
# #SBATCH --time=00:05:00 # Time limit hrs:min:sec
|
||||
#SBATCH --output=task-${CLEARML_TASK.id}-%j.log
|
||||
#SBATCH --partition debug
|
||||
#SBATCH --cpus-per-task=1
|
||||
#SBATCH --priority=5
|
||||
#SBATCH --nodes=${CLEARML_TASK.hyperparams.properties.num_nodes.value:1}
|
||||
|
||||
|
||||
${CLEARML_PRE_SETUP}
|
||||
|
||||
echo whoami $(whoami)
|
||||
|
||||
${CLEARML_AGENT_EXECUTE}
|
||||
|
||||
${CLEARML_POST_SETUP}
|
||||
```
|
||||
|
||||
Notice: If you are using Slurm with Singularity container support replace `${CLEARML_AGENT_EXECUTE}` in the batch
|
||||
template with `singularity exec ${CLEARML_AGENT_EXECUTE}`. For additional required settings, see [Slurm with Singularity](#slurm-with-singularity).
|
||||
|
||||
:::tip
|
||||
You can override the default values of a Slurm job template via the ClearML Web UI. The following command in the
|
||||
template sets the `nodes` value to be the ClearML Task’s `num_nodes` user property:
|
||||
```
|
||||
#SBATCH --nodes=${CLEARML_TASK.hyperparams.properties.num_nodes.value:1}
|
||||
```
|
||||
This user property can be modified in the UI, in the task's **CONFIGURATION > User Properties** section, and when the
|
||||
task is executed the new modified value will be used.
|
||||
:::
|
||||
|
||||
3. Launch the ClearML Agent Slurm Glue and assign the Slurm configuration to a ClearML queue. For example, the following
|
||||
associates the `default` queue to the `slurm.example.template` script, so any jobs pushed to this queue will use the
|
||||
resources set by that script.
|
||||
```
|
||||
clearml-agent-slurm --template-files slurm.example.template --queue default
|
||||
```
|
||||
|
||||
You can also pass multiple templates and queues. For example:
|
||||
```
|
||||
clearml-agent-slurm --template-files slurm.template1 slurm.template2 --queue queue1 queue2
|
||||
```
|
||||
|
||||
### Slurm with Singularity
|
||||
If you are running Slurm with Singularity containers support, set the following:
|
||||
|
||||
1. Make sure your `sbatch` template contains:
|
||||
```
|
||||
singularity exec ${CLEARML_AGENT_EXECUTE}
|
||||
```
|
||||
Additional singularity arguments can be added, for example:
|
||||
```
|
||||
singularity exec --uts ${CLEARML_AGENT_EXECUTE}`
|
||||
```
|
||||
1. Set the default Singularity container to use in your [clearml.conf](../configs/clearml_conf.md) file:
|
||||
```
|
||||
agent.default_docker.image="shub://repo/hello-world"
|
||||
```
|
||||
Or
|
||||
```
|
||||
agent.default_docker.image="docker://ubuntu"
|
||||
```
|
||||
|
||||
1. Add `--singularity-mode` to the command line, for example:
|
||||
```
|
||||
clearml-agent-slurm --singularity-mode --template-files slurm.example_singularity.template --queue default
|
||||
```
|
||||
|
||||
## Explicit Task Execution
|
||||
|
||||
ClearML Agent can also execute specific tasks directly, without listening to a queue.
|
||||
|
||||
### Execute a Task without Queue
|
||||
|
||||
Execute a Task with a `clearml-agent` worker without a queue.
|
||||
```bash
|
||||
clearml-agent execute --id <task-id>
|
||||
```
|
||||
### Clone a Task and Execute the Cloned Task
|
||||
|
||||
Clone the specified Task and execute the cloned Task with a `clearml-agent` worker without a queue.
|
||||
```bash
|
||||
clearml-agent execute --id <task-id> --clone
|
||||
```
|
||||
|
||||
### Execute Task inside a Docker
|
||||
|
||||
Execute a Task with a `clearml-agent` worker using a Docker container without a queue.
|
||||
```bash
|
||||
clearml-agent execute --id <task-id> --docker
|
||||
```
|
||||
|
||||
## Debugging
|
||||
|
||||
Run a `clearml-agent` daemon in foreground mode, sending all output to the console.
|
||||
```bash
|
||||
clearml-agent daemon --queue default --foreground
|
||||
```
|
48
docs/clearml_agent/clearml_agent_docker.md
Normal file
48
docs/clearml_agent/clearml_agent_docker.md
Normal file
@ -0,0 +1,48 @@
|
||||
---
|
||||
title: Building Docker Containers
|
||||
---
|
||||
|
||||
## Exporting a Task into a Standalone Docker Container
|
||||
|
||||
### Task Container
|
||||
|
||||
Build a Docker container that when launched executes a specific experiment, or a clone (copy) of that experiment.
|
||||
|
||||
- Build a Docker container that at launch will execute a specific Task:
|
||||
|
||||
```bash
|
||||
clearml-agent build --id <task-id> --docker --target <new-docker-name> --entry-point reuse_task
|
||||
```
|
||||
|
||||
- Build a Docker container that at launch will clone a Task specified by Task ID, and will execute the newly cloned Task:
|
||||
|
||||
```bash
|
||||
clearml-agent build --id <task-id> --docker --target <new-docker-name> --entry-point clone_task
|
||||
```
|
||||
|
||||
- Run built Docker by executing:
|
||||
|
||||
```bash
|
||||
docker run <new-docker-name>
|
||||
```
|
||||
|
||||
Check out [this tutorial](../guides/clearml_agent/executable_exp_containers.md) for building executable experiment
|
||||
containers.
|
||||
|
||||
### Base Docker Container
|
||||
|
||||
Build a Docker container according to the execution environment of a specific task.
|
||||
|
||||
```bash
|
||||
clearml-agent build --id <task-id> --docker --target <new-docker-name>
|
||||
```
|
||||
|
||||
You can add the Docker container as the base Docker image to a task (experiment), using one of the following methods:
|
||||
|
||||
- Using the **ClearML Web UI** - See [Base Docker image](../webapp/webapp_exp_tuning.md#base-docker-image) on the "Tuning
|
||||
Experiments" page.
|
||||
- In the ClearML configuration file - Use the ClearML configuration file [`agent.default_docker`](../configs/clearml_conf.md#agentdefault_docker)
|
||||
options.
|
||||
|
||||
Check out [this tutorial](../guides/clearml_agent/exp_environment_containers.md) for building a Docker container
|
||||
replicating the execution environment of an existing task.
|
46
docs/clearml_agent/clearml_agent_dynamic_gpus.md
Normal file
46
docs/clearml_agent/clearml_agent_dynamic_gpus.md
Normal file
@ -0,0 +1,46 @@
|
||||
---
|
||||
title: Dynamic GPU Allocation
|
||||
---
|
||||
:::important Enterprise Feature
|
||||
This feature is available under the ClearML Enterprise plan
|
||||
:::
|
||||
|
||||
The ClearML Enterprise server supports dynamic allocation of GPUs based on queue properties.
|
||||
Agents can spin multiple Tasks from different queues based on the number of GPUs the queue
|
||||
needs.
|
||||
|
||||
`dynamic-gpus` enables dynamic allocation of GPUs based on queue properties.
|
||||
To configure the number of GPUs for a queue, use the `--gpus` flag to specify the active GPUs, and use the `--queue`
|
||||
flag to specify the queue name and number of GPUs:
|
||||
|
||||
```console
|
||||
clearml-agent daemon --dynamic-gpus --gpus 0-2 --queue dual_gpus=2 single_gpu=1
|
||||
```
|
||||
|
||||
## Example
|
||||
|
||||
Let's say a server has three queues:
|
||||
* `dual_gpu`
|
||||
* `quad_gpu`
|
||||
* `opportunistic`
|
||||
|
||||
An agent can be spun on multiple GPUs (for example: 8 GPUs, `--gpus 0-7`), and then attached to multiple
|
||||
queues that are configured to run with a certain amount of resources:
|
||||
|
||||
```console
|
||||
clearml-agent daemon --dynamic-gpus --gpus 0-7 --queue quad_gpu=4 dual_gpu=2
|
||||
```
|
||||
|
||||
The agent can now spin multiple Tasks from the different queues based on the number of GPUs configured to the queue.
|
||||
The agent will pick a Task from the `quad_gpu` queue, use GPUs 0-3 and spin it. Then it will pick a Task from the `dual_gpu`
|
||||
queue, look for available GPUs again and spin on GPUs 4-5.
|
||||
|
||||
Another option for allocating GPUs:
|
||||
|
||||
```console
|
||||
clearml-agent daemon --dynamic-gpus --gpus 0-7 --queue dual=2 opportunistic=1-4
|
||||
```
|
||||
|
||||
Notice that a minimum and maximum value of GPUs is specified for the `opportunistic` queue. This means the agent
|
||||
will pull a Task from the `opportunistic` queue and allocate up to 4 GPUs based on availability (i.e. GPUs not currently
|
||||
being used by other agents).
|
33
docs/clearml_agent/clearml_agent_env_caching.md
Normal file
33
docs/clearml_agent/clearml_agent_env_caching.md
Normal file
@ -0,0 +1,33 @@
|
||||
---
|
||||
title: Environment Caching
|
||||
---
|
||||
|
||||
ClearML Agent caches virtual environments so when running experiments multiple times, there's no need to spend time reinstalling
|
||||
pre-installed packages. To make use of the cached virtual environments, enable the virtual environment reuse mechanism.
|
||||
|
||||
## Virtual Environment Reuse
|
||||
|
||||
The virtual environment reuse feature may reduce experiment startup time dramatically.
|
||||
|
||||
By default, ClearML uses the package manager's environment caching. This means that even if no
|
||||
new packages need to be installed, checking the list of packages can take a long time.
|
||||
|
||||
ClearML has a virtual environment reuse mechanism which, when enabled, allows using environments as-is without resolving
|
||||
installed packages. This means that when executing multiple experiments with the same package dependencies,
|
||||
the same environment will be used.
|
||||
|
||||
:::note
|
||||
ClearML does not support environment reuse when using Poetry package manager
|
||||
:::
|
||||
|
||||
To enable environment reuse, modify the `clearml.conf` file and unmark the `venvs_cache` section.
|
||||
```
|
||||
venvs_cache: {
|
||||
# maximum number of cached venvs
|
||||
max_entries: 10
|
||||
# minimum required free space to allow for cache entry, disable by passing 0 or negative value
|
||||
free_space_threshold_gb: 2.0
|
||||
# unmark to enable virtual environment caching
|
||||
# path: ~/.clearml/venvs-cache
|
||||
},
|
||||
```
|
@ -6,7 +6,7 @@ This page lists the available environment variables for configuring ClearML Agen
|
||||
|
||||
In addition to the environment variables listed below, ClearML also supports **dynamic environment variables** to override
|
||||
any configuration option that appears in the [`agent`](../configs/clearml_conf.md#agent-section) section of the `clearml.conf`.
|
||||
For more information, see [Dynamic Environment Variables](../clearml_agent.md#dynamic-environment-variables).
|
||||
For more information, see [Dynamic Environment Variables](../clearml_agent/clearml_agent_setup.md#dynamic-environment-variables).
|
||||
|
||||
:::info
|
||||
ClearML's environment variables override the [clearml.conf file](../configs/clearml_conf.md), SDK, and
|
||||
@ -16,7 +16,7 @@ but can be overridden by command-line arguments.
|
||||
|
||||
|Name| Description |
|
||||
|---|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
|**CLEARML_DOCKER_IMAGE** | Sets the default docker image to use when running an agent in [Docker mode](../clearml_agent.md#docker-mode) |
|
||||
|**CLEARML_DOCKER_IMAGE** | Sets the default docker image to use when running an agent in [Docker mode](../clearml_agent/clearml_agent_execution_env.md#docker-mode) |
|
||||
|**CLEARML_WORKER_NAME** | Sets the Worker's name |
|
||||
|**CLEARML_WORKER_ID** | Sets the Worker ID |
|
||||
|**CLEARML_CUDA_VERSION** | Sets the CUDA version to be used |
|
||||
|
70
docs/clearml_agent/clearml_agent_execution_env.md
Normal file
70
docs/clearml_agent/clearml_agent_execution_env.md
Normal file
@ -0,0 +1,70 @@
|
||||
---
|
||||
title: Execution Environments
|
||||
---
|
||||
ClearML Agent has two primary execution modes: [Virtual Environment Mode](#virtual-environment-mode) and [Docker Mode](#docker-mode).
|
||||
|
||||
## Virtual Environment Mode
|
||||
|
||||
In Virtual Environment Mode, the agent creates a virtual environment for the experiment, installs the required Python
|
||||
packages based on the task specification, clones the code repository, applies the uncommitted changes and finally
|
||||
executes the code while monitoring it. This mode uses smart caching so packages and environments can be reused over
|
||||
multiple tasks (see [Virtual Environment Reuse](clearml_agent_env_caching.md#virtual-environment-reuse)).
|
||||
|
||||
ClearML Agent supports working with one of the following package managers:
|
||||
* [`pip`](https://en.wikipedia.org/wiki/Pip_(package_manager)) (default)
|
||||
* [`conda`](https://docs.conda.io/en/latest/)
|
||||
* [`poetry`](https://python-poetry.org/)
|
||||
|
||||
To change the package manager used by the agent, edit the [`package_manager.type`](../configs/clearml_conf.md#agentpackage_manager)
|
||||
field in the of the `clearml.conf`. If extra channels are needed for `conda`, add the missing channels in the
|
||||
`package_manager.conda_channels` field in the `clearml.conf`.
|
||||
|
||||
:::note Using Poetry with Pyenv
|
||||
Some versions of poetry (using `install-poetry.py`) do not respect `pyenv global`.
|
||||
If you are using pyenv to control the environment where you use ClearML Agent, you can:
|
||||
* Use poetry v1.2 and above (which fixes [this issue](https://github.com/python-poetry/poetry/issues/5077))
|
||||
* Install poetry with the deprecated `get-poetry.py` installer
|
||||
:::
|
||||
|
||||
## Docker Mode
|
||||
:::note notes
|
||||
* Docker Mode is only supported in linux.
|
||||
* Docker Mode requires docker service v19.03 or higher installed.
|
||||
:::
|
||||
|
||||
When executing the ClearML Agent in Docker mode, it will:
|
||||
1. Run the provided Docker container
|
||||
1. Install ClearML Agent in the container
|
||||
1. Execute the Task in the container, and monitor the process.
|
||||
|
||||
ClearML Agent uses the provided default Docker container, which can be overridden from the UI.
|
||||
|
||||
:::tip Setting Docker Container via UI
|
||||
You can set the docker container via the UI:
|
||||
1. Clone the experiment
|
||||
2. Set the Docker in the cloned task's **Execution** tab **> Container** section
|
||||

|
||||
3. Enqueue the cloned task
|
||||
|
||||
The task will be executed in the container specified in the UI.
|
||||
:::
|
||||
|
||||
All ClearML Agent flags (such as `--gpus` and `--foreground`) are applicable to Docker mode as well.
|
||||
|
||||
* To execute ClearML Agent in Docker mode, run:
|
||||
```bash
|
||||
clearml-agent daemon --queue <execution_queue_to_pull_from> --docker [optional default docker image to use]
|
||||
```
|
||||
|
||||
* To use the current `clearml-agent` version in the Docker container, instead of the latest `clearml-agent` version that is
|
||||
automatically installed, pass the `--force-current-version` flag:
|
||||
```bash
|
||||
clearml-agent daemon --queue default --docker --force-current-version
|
||||
```
|
||||
|
||||
* For Kubernetes, specify a host mount on the daemon host. Do not use the host mount inside the Docker container.
|
||||
Set the environment variable `CLEARML_AGENT_K8S_HOST_MOUNT`.
|
||||
For example:
|
||||
```
|
||||
CLEARML_AGENT_K8S_HOST_MOUNT=/mnt/host/data:/root/.clearml
|
||||
```
|
358
docs/clearml_agent/clearml_agent_fractional_gpus.md
Normal file
358
docs/clearml_agent/clearml_agent_fractional_gpus.md
Normal file
@ -0,0 +1,358 @@
|
||||
---
|
||||
title: Fractional GPUs
|
||||
---
|
||||
Some tasks that you send for execution need a minimal amount of compute and memory, but you end up allocating entire
|
||||
GPUs to them. In order to optimize your compute resource usage, you can partition GPUs into slices. You can have a GPU
|
||||
device run multiple isolated workloads on separate slices that will not impact each other, and will only use the
|
||||
fraction of GPU memory allocated to them.
|
||||
|
||||
ClearML provides several GPU slicing options to optimize compute resource utilization:
|
||||
* [Container-based Memory Limits](#container-based-memory-limits): Use pre-packaged containers with built-in memory
|
||||
limits to run multiple containers on the same GPU (**Available as part of the ClearML open source offering**)
|
||||
* [Kubernetes-based Static MIG Slicing](#kubernetes-static-mig-fractions): Set up Kubernetes support for NVIDIA MIG
|
||||
(Multi-Instance GPU) to define GPU fractions for specific workloads (**Available as part of the ClearML open source offering**)
|
||||
* Dynamic GPU Slicing: On-demand GPU slicing per task for both MIG and non-MIG devices (**Available under the ClearML Enterprise plan**):
|
||||
* [Bare Metal deployment](#bare-metal-deployment)
|
||||
* [Kubernetes deployment](#kubernetes-deployment)
|
||||
|
||||
## Container-based Memory Limits
|
||||
Use [`clearml-fractional-gpu`](https://github.com/allegroai/clearml-fractional-gpu)'s pre-packaged containers with
|
||||
built-in hard memory limitations. Workloads running in these containers will only be able to use up to the container's
|
||||
memory limit. Multiple isolated workloads can run on the same GPU without impacting each other.
|
||||
|
||||

|
||||
|
||||
### Usage
|
||||
|
||||
#### Manual Execution
|
||||
|
||||
1. Choose the container with the appropriate memory limit. ClearML supports CUDA 11.x and CUDA 12.x with memory limits
|
||||
ranging from 2 GB to 12 GB (see [clearml-fractional-gpu repository](https://github.com/allegroai/clearml-fractional-gpu/blob/main/README.md#-containers) for full list).
|
||||
1. Launch the container:
|
||||
|
||||
```bash
|
||||
docker run -it --gpus 0 --ipc=host --pid=host clearml/fractional-gpu:u22-cu12.3-8gb bash
|
||||
```
|
||||
|
||||
This example runs the ClearML Ubuntu 22 with CUDA 12.3 container on GPU 0, which is limited to use up to 8GB of its memory.
|
||||
:::note
|
||||
--pid=host is required to allow the driver to differentiate between the container's processes and other host processes when limiting memory usage
|
||||
:::
|
||||
1. Run the following command inside the container to verify that the fractional gpu memory limit is working correctly:
|
||||
```bash
|
||||
nvidia-smi
|
||||
```
|
||||
Here is the expected output for the previous, 8GB limited, example on an A100:
|
||||
```bash
|
||||
+---------------------------------------------------------------------------------------+
|
||||
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|
||||
|-----------------------------------------+----------------------+----------------------+
|
||||
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
|
||||
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
|
||||
| | | MIG M. |
|
||||
|=========================================+======================+======================|
|
||||
| 0 A100-PCIE-40GB Off | 00000000:01:00.0 Off | N/A |
|
||||
| 32% 33C P0 66W / 250W | 0MiB / 8128MiB | 3% Default |
|
||||
| | | Disabled |
|
||||
+-----------------------------------------+----------------------+----------------------+
|
||||
|
||||
+---------------------------------------------------------------------------------------+
|
||||
| Processes: |
|
||||
| GPU GI CI PID Type Process name GPU Memory |
|
||||
| ID ID Usage |
|
||||
|=======================================================================================|
|
||||
+---------------------------------------------------------------------------------------+
|
||||
```
|
||||
#### Remote Execution
|
||||
|
||||
You can set a ClearML Agent to execute tasks in a fractional GPU container. Set an agent’s default container via its
|
||||
command line. For example, all tasks pulled from the `default` queue by this agent will be executed in the Ubuntu 22
|
||||
with CUDA 12.3 container, which is limited to use up to 8GB of its memory:
|
||||
|
||||
```bash
|
||||
clearml-agent daemon --queue default --docker clearml/fractional-gpu:u22-cu12.3-8gb
|
||||
```
|
||||
|
||||
The agent’s default container can be overridden via the UI:
|
||||
1. Clone the task
|
||||
1. Set the Docker in the cloned task's **Execution** tab > **Container** section
|
||||
|
||||

|
||||
|
||||
1. Enqueue the cloned task
|
||||
|
||||
The task will be executed in the container specified in the UI.
|
||||
|
||||
For more information, see [Docker Mode](clearml_agent_execution_env.md#docker-mode).
|
||||
|
||||
#### Fractional GPU Containers on Kubernetes
|
||||
Fractional GPU containers can be used to limit the memory consumption of your Kubernetes Job/Pod, and have multiple
|
||||
containers share GPU devices without interfering with each other.
|
||||
|
||||
For example, the following configures a K8s pod to run using the `clearml/fractional-gpu:u22-cu12.3-8gb` container,
|
||||
which limits the pod to 8 GB of the GPU's memory:
|
||||
```
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: train-pod
|
||||
labels:
|
||||
app: trainme
|
||||
spec:
|
||||
hostPID: true
|
||||
containers:
|
||||
- name: train-container
|
||||
image: clearml/fractional-gpu:u22-cu12.3-8gb
|
||||
command: ['python3', '-c', 'print(f"Free GPU Memory: (free, global) {torch.cuda.mem_get_info()}")']
|
||||
```
|
||||
|
||||
:::note
|
||||
`hostPID: true` is required to allow the driver to differentiate between the pod's processes and other host processes
|
||||
when limiting memory usage.
|
||||
:::
|
||||
|
||||
### Custom Container
|
||||
Build your own custom fractional GPU container by inheriting from one of ClearML's containers: In your Dockerfile, make
|
||||
sure to include `From <clearml_container_image>` so the container will inherit from the relevant container.
|
||||
|
||||
See example custom Dockerfiles in the [clearml-fractional-gpu repository](https://github.com/allegroai/clearml-fractional-gpu/tree/main/examples).
|
||||
|
||||
## Kubernetes Static MIG Fractions
|
||||
Set up NVIDIA MIG (Multi-Instance GPU) support for Kubernetes to define GPU fraction profiles for specific workloads
|
||||
through your NVIDIA device plugin.
|
||||
|
||||
The ClearML Agent Helm chart lets you specify a pod template for each queue which describes the resources that the pod
|
||||
will use. The template should specify the requested GPU slices under `Containers.resources.limits` to have the pods use
|
||||
the defined resources. For example, the following configures a K8s pod to run a `3g.20gb` MIG device:
|
||||
```
|
||||
# tf-benchmarks-mixed.yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: tf-benchmarks-mixed
|
||||
spec:
|
||||
restartPolicy: Never
|
||||
Containers:
|
||||
- name: tf-benchmarks-mixed
|
||||
image: ""
|
||||
command: []
|
||||
args: []
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/mig-3g.20gb: 1
|
||||
nodeSelector: #optional
|
||||
nvidia.com/gpu.product: A100-SXM4-40GB
|
||||
```
|
||||
|
||||
When tasks are added to the relevant queue, the agent pulls the task and creates a pod to execute it, using the
|
||||
specified GPU slice.
|
||||
|
||||
For example, the following configures tasks from the default queue to use `1g.5gb` MIG slices:
|
||||
```
|
||||
agentk8sglue:
|
||||
queue: default
|
||||
# …
|
||||
basePodTemplate:
|
||||
# …
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
nodeSelector:
|
||||
nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
|
||||
```
|
||||
|
||||
## Dynamic GPU Fractions
|
||||
|
||||
:::important Enterprise Feature
|
||||
Dynamic GPU slicing is available under the ClearML Enterprise plan.
|
||||
:::
|
||||
|
||||
ClearML dynamic GPU fractions provide on-the-fly, per task GPU slicing, without having to set up containers or
|
||||
pre-configure tasks with memory limits. Specify a GPU fraction for a queue in the agent invocation, and every task the
|
||||
agent pulls from the queue will run on a container with the specified limit. This way you can safely run multiple tasks
|
||||
simultaneously without worrying that one task will use all of the GPU's memory.
|
||||
|
||||
You can dynamically slice GPUs on [bare metal](#bare-metal-deployment) or on [Kubernetes](#kubernetes-deployment), for
|
||||
both MIG-enabled and non-MIG devices.
|
||||
|
||||
### Bare Metal Deployment
|
||||
1. Install the required packages:
|
||||
|
||||
```bash
|
||||
pip install clearml-agent clearml-agent-fractional-gpu
|
||||
```
|
||||
1. Start the ClearML agent with dynamic GPU allocation. Use `--gpus` to specify the active GPUs, and use the `--queue`
|
||||
flag to specify the queue name(s) and number (or fraction) of GPUs to allocate to them.
|
||||
|
||||
```
|
||||
clearml-agent daemon --dynamic-gpus --gpus 0, 1 --queue half_gpu=0.5
|
||||
```
|
||||
|
||||
The agent can utilize 2 GPUs (GPUs 0 and 1). Every task enqueued to the `half_gpu` queue will be run by the agent and
|
||||
only allocated 50% GPU memory (i.e. 4 tasks can run concurrently).
|
||||
|
||||
:::note
|
||||
You can allocate GPUs for a queue’s tasks by specifying either a fraction of a single GPU in increments as small as 0.125
|
||||
(e.g. 0.125, 0.25, 0.50, etc.) or whole GPUs (e.g. 1, 2, 4, etc.). However, you cannot specify fractions greater than
|
||||
one GPU (e.g. 1.25).
|
||||
:::
|
||||
|
||||
You can set up multiple queues, each allocated a different number of GPUs per task. Note that the order that the queues
|
||||
are listed is their order of priority, so the agent will service tasks from the first listed queue before servicing
|
||||
subsequent queues:
|
||||
```
|
||||
clearml-agent daemon --dynamic-gpus --gpus 0-2 --queue dual_gpus=2 quarter_gpu=0.25 half_gpu=0.5 single_gpu=1
|
||||
```
|
||||
|
||||
This agent will utilize 3 GPUs (GPUs 0, 1, and 2). The agent can spin multiple jobs from the different queues based on
|
||||
the number of GPUs configured to the queue.
|
||||
|
||||
#### Example Workflow
|
||||
Let’s say that four tasks are enqueued, one task for each of the above queues (`dual_gpus`, `quarter_gpu`, `half_gpu`,
|
||||
`single_gpu`). The agent will first pull the task from the `dual_gpus` queue since it is listed first, and will run it
|
||||
using 2 GPUs. It will next run the tasks from `quarter_gpu` and `half_gpu`--both will run on the remaining available
|
||||
GPU. This leaves the task in the `single_gpu` queue. Currently 2.75 GPUs out of the 3 are in use so the task will only
|
||||
be pulled and run when enough GPUs become available.
|
||||
|
||||
### Kubernetes Deployment
|
||||
|
||||
ClearML supports fractional GPUs on Kubernetes through custom Enterprise Helm Charts for both MIG and non-MIG devices:
|
||||
* `clearml-dynamic-mig-operator` for [MIG devices](#mig-enabled-gpus)
|
||||
* `clearml-fractional-gpu-injector` for [non-MIG devices](#non-mig-devices)
|
||||
|
||||
For either setup, you can set up in your Enterprise ClearML Agent Helm chart the resources requirements of tasks sent to
|
||||
each queue. When a task is enqueued in ClearML, it translates into a Kubernetes pod running on the designated device
|
||||
with the specified fractional resource as defined in the Agent Helm chart.
|
||||
|
||||
#### MIG-enabled GPUs
|
||||
The **ClearML Dynamic MIG Operator** (CDMO) chart enables running AI workloads on K8s with optimized hardware utilization
|
||||
and workload performance by facilitating MIG GPU partitioning. Make sure you have a [MIG capable GPU](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus).
|
||||
|
||||
##### Prepare Cluster
|
||||
* Install the [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator):
|
||||
|
||||
```
|
||||
helm repo add nvidia https://helm.ngc.nvidia.com
|
||||
helm repo update
|
||||
|
||||
helm install -n gpu-operator \
|
||||
gpu-operator \
|
||||
nvidia/gpu-operator \
|
||||
--create-namespace \
|
||||
--set migManager.enabled=false \
|
||||
--set mig.strategy=mixed
|
||||
```
|
||||
* Enable MIG support:
|
||||
1. Enable dynamic MIG support on your cluster by running following command on all nodes used for training (run for each GPU ID in your cluster):
|
||||
|
||||
```
|
||||
nvidia-smi -i <gpu_id> -mig 1
|
||||
```
|
||||
1. Reboot node if required.
|
||||
1. Add following label to all nodes that will be used for training:
|
||||
|
||||
```
|
||||
kubectl label nodes <node-name> "cdmo.clear.ml/gpu-partitioning=mig"
|
||||
```
|
||||
|
||||
##### Configure ClearML Queues
|
||||
The ClearML Enterprise plan supports K8S servicing multiple ClearML queues, as well as providing a pod template for each
|
||||
queue for describing the resources for each pod to use.
|
||||
|
||||
In the `values.yaml` file, set the resource requirements of each ClearML queue. For example, the following configures
|
||||
what resources to use for the `default025` and the `default050` queues:
|
||||
```
|
||||
agentk8sglue:
|
||||
queues:
|
||||
default025:
|
||||
templateOverrides:
|
||||
labels:
|
||||
required-resources: "0.25"
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/mig-1g.10gb: 1
|
||||
default050:
|
||||
templateOverrides:
|
||||
labels:
|
||||
required-resources: "0.50"
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/mig-1g.10gb: 1
|
||||
```
|
||||
|
||||
#### Non-MIG Devices
|
||||
The **Fractional GPU Injector** chart enables running AI workloads on k8s in an optimized way, allowing you to use
|
||||
fractional GPUs on non-MIG devices.
|
||||
|
||||
##### Requirements
|
||||
Install the [Nvidia GPU Operator](https://github.com/NVIDIA/gpu-operator) through the Helm chart. Make sure `timeSlicing`
|
||||
is enabled.
|
||||
|
||||
For example:
|
||||
```
|
||||
devicePlugin:
|
||||
config:
|
||||
name: device-plugin-config
|
||||
create: true
|
||||
default: "any"
|
||||
data:
|
||||
any: |-
|
||||
version: v1
|
||||
flags:
|
||||
migStrategy: none
|
||||
sharing:
|
||||
timeSlicing:
|
||||
renameByDefault: false
|
||||
failRequestsGreaterThanOne: false
|
||||
resources:
|
||||
- name: nvidia.com/gpu
|
||||
replicas: 4
|
||||
```
|
||||
|
||||
The number of replicas is the maximum number of slices on a GPU.
|
||||
|
||||
##### Configure ClearML Queues
|
||||
In the `values.yaml` file, set the resource requirements of each ClearML queue. When a task is enqueued to the queue,
|
||||
it translates into a Kubernetes pod running on the designated device with the specified resource slice. The queues must
|
||||
be configured with specific labels and annotations. For example, the following configures the `default0500` queue to use
|
||||
50% of a GPU and the `default0250` queue to use 25% of a GPU:
|
||||
```
|
||||
agentk8sglue:
|
||||
queues:
|
||||
default0500:
|
||||
templateOverrides:
|
||||
labels:
|
||||
required-resources: "0.5"
|
||||
clearml-injector/fraction: "0.500"
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
clear.ml/fraction-1: "0.5"
|
||||
queueSettings:
|
||||
maxPods: 10
|
||||
default0250:
|
||||
templateOverrides:
|
||||
labels:
|
||||
required-resources: "0.25"
|
||||
clearml-injector/fraction: "0.250"
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
clear.ml/fraction-1: "0.25"
|
||||
queueSettings:
|
||||
maxPods: 10
|
||||
```
|
||||
If a pod has a label matching the pattern `clearml-injector/fraction: "<gpu_fraction_value>"`, the injector will
|
||||
configure that pod to utilize the specified fraction of the GPU:
|
||||
```
|
||||
labels:
|
||||
clearml-injector/fraction: "<gpu_fraction_value>"
|
||||
```
|
||||
Where `<gpu_fraction_value>` must be set to one of the following values:
|
||||
* "0.125"
|
||||
* "0.250"
|
||||
* "0.375"
|
||||
* "0.500"
|
||||
* "0.625"
|
||||
* "0.750"
|
||||
* "0.875"
|
8
docs/clearml_agent/clearml_agent_google_colab.md
Normal file
8
docs/clearml_agent/clearml_agent_google_colab.md
Normal file
@ -0,0 +1,8 @@
|
||||
---
|
||||
title: Google Colab
|
||||
---
|
||||
|
||||
ClearML Agent can run on a [Google Colab](https://colab.research.google.com/) instance. This helps users to leverage
|
||||
compute resources provided by Google Colab and send experiments for execution on it.
|
||||
|
||||
Check out [this tutorial](../guides/ide/google_colab.md) on how to run a ClearML Agent on Google Colab!
|
120
docs/clearml_agent/clearml_agent_scheduling.md
Normal file
120
docs/clearml_agent/clearml_agent_scheduling.md
Normal file
@ -0,0 +1,120 @@
|
||||
---
|
||||
title: Scheduling Working Hours
|
||||
---
|
||||
:::important Enterprise Feature
|
||||
This feature is available under the ClearML Enterprise plan
|
||||
:::
|
||||
|
||||
The Agent scheduler enables scheduling working hours for each Agent. During working hours, a worker will actively poll
|
||||
queues for Tasks, fetch and execute them. Outside working hours, a worker will be idle.
|
||||
|
||||
Schedule workers by:
|
||||
|
||||
* Setting configuration file options
|
||||
* Running `clearml-agent` from the command line (overrides configuration file options)
|
||||
|
||||
Override worker schedules by:
|
||||
|
||||
* Setting runtime properties to force a worker on or off
|
||||
* Tagging a queue on or off
|
||||
|
||||
## Running clearml-agent with a Schedule (Command Line)
|
||||
|
||||
Set a schedule for a worker from the command line when running `clearml-agent`. Two properties enable setting working hours:
|
||||
|
||||
:::warning
|
||||
Use only one of these properties
|
||||
:::
|
||||
|
||||
* `uptime` - Time span during which a worker will actively poll a queue(s) for Tasks, and execute them. Outside this
|
||||
time span, the worker will be idle.
|
||||
* `downtime` - Time span during which a worker will be idle. Outside this time span, the worker will actively poll and
|
||||
execute Tasks.
|
||||
|
||||
Define `uptime` or `downtime` as `"<hours> <days>"`, where:
|
||||
|
||||
* `<hours>` - A span of hours (`00-23`) or a single hour. A single hour defines a span from that hour to midnight.
|
||||
* `<days>` - A span of days (`SUN-SAT`) or a single day.
|
||||
|
||||
Use `-` for a span, and `,` to separate individual values. To span before midnight to after midnight, use two spans.
|
||||
|
||||
For example:
|
||||
|
||||
* `"20-23 SUN"` - 8 PM to 11 PM on Sundays.
|
||||
* `"20-23 SUN,TUE"` - 8 PM to 11 PM on Sundays and Tuesdays.
|
||||
* `"20-23 SUN-TUE"` - 8 PM to 11 PM on Sundays, Mondays, and Tuesdays.
|
||||
* `"20 SUN"` - 8 PM to midnight on Sundays.
|
||||
* `"20-00,00-08 SUN"` - 8 PM to midnight and midnight to 8 AM on Sundays
|
||||
* `"20-00 SUN", "00-08 MON"` - 8 PM on Sundays to 8 AM on Mondays (spans from before midnight to after midnight).
|
||||
|
||||
## Setting Worker Schedules in the Configuration File
|
||||
|
||||
Set a schedule for a worker using configuration file options. The options are:
|
||||
|
||||
:::warning
|
||||
Use only one of these properties
|
||||
:::
|
||||
|
||||
* ``agent.uptime``
|
||||
* ``agent.downtime``
|
||||
|
||||
Use the same time span format for days and hours as is used in the command line.
|
||||
|
||||
For example, set a worker's schedule from 5 PM to 8 PM on Sunday through Tuesday, and 1 PM to 10 PM on Wednesday.
|
||||
|
||||
```
|
||||
agent.uptime: ["17-20 SUN-TUE", "13-22 WED"]
|
||||
```
|
||||
|
||||
## Overriding Worker Schedules Using Runtime Properties
|
||||
|
||||
Runtime properties override the command line uptime / downtime properties. The runtime properties are:
|
||||
|
||||
:::warning
|
||||
Use only one of these properties
|
||||
:::
|
||||
|
||||
* `force:on` - Pull and execute Tasks until the property expires.
|
||||
* `force:off` - Prevent pulling and execution of Tasks until the property expires.
|
||||
|
||||
Currently, these runtime properties can only be set using an ClearML REST API call to the `workers.set_runtime_properties`
|
||||
endpoint, as follows:
|
||||
|
||||
* The body of the request must contain the `worker-id`, and the runtime property to add.
|
||||
* An expiry date is optional. Use the format `"expiry":<time>`. For example, `"expiry":86400` will set an expiry of 24 hours.
|
||||
* To delete the property, set the expiry date to zero, `"expiry":0`.
|
||||
|
||||
For example, to force a worker on for 24 hours:
|
||||
|
||||
```
|
||||
curl --user <key>:<secret> --header "Content-Type: application/json" --data '{"worker":"<worker_id>","runtime_properties":[{"key": "force", "value": "on", "expiry": 86400}]}' http://<api-server-hostname-or-ip>:8008/workers.set_runtime_properties
|
||||
```
|
||||
|
||||
## Overriding Worker Schedules Using Queue Tags
|
||||
|
||||
Queue tags override command line and runtime properties. The queue tags are the following:
|
||||
|
||||
:::warning
|
||||
Use only one of these properties
|
||||
:::
|
||||
|
||||
* ``force_workers:on`` - Any worker listening to the queue will keep pulling Tasks from the queue.
|
||||
* ``force_workers:off`` - Prevent all workers listening to the queue from pulling Tasks from the queue.
|
||||
|
||||
Currently, you can set queue tags using an ClearML REST API call to the ``queues.update`` endpoint, or the
|
||||
APIClient. The body of the call must contain the ``queue-id`` and the tags to add.
|
||||
|
||||
For example, force workers on for a queue using the APIClient:
|
||||
|
||||
```python
|
||||
from clearml.backend_api.session.client import APIClient
|
||||
|
||||
client = APIClient()
|
||||
client.queues.update(queue="<queue_id>", tags=["force_workers:on"])
|
||||
```
|
||||
|
||||
Or, force workers on for a queue using the REST API:
|
||||
|
||||
```bash
|
||||
curl --user <key>:<secret> --header "Content-Type: application/json" --data '{"queue":"<queue_id>","tags":["force_workers:on"]}' http://<api-server-hostname-or-ip>:8008/queues.update
|
||||
```
|
38
docs/clearml_agent/clearml_agent_services_mode.md
Normal file
38
docs/clearml_agent/clearml_agent_services_mode.md
Normal file
@ -0,0 +1,38 @@
|
||||
---
|
||||
title: Services Mode
|
||||
---
|
||||
ClearML Agent supports a **Services Mode** where, as soon as a task is launched off of its queue, the agent moves on to the
|
||||
next task without waiting for the previous one to complete. This mode is intended for running resource-sparse tasks that
|
||||
are usually idling, such as periodic cleanup services or a [pipeline controller](../references/sdk/automation_controller_pipelinecontroller.md).
|
||||
|
||||
To run a `clearml-agent` in services mode, run:
|
||||
```bash
|
||||
clearml-agent daemon --services-mode --queue services --create-queue --docker <docker_name> --cpu-only
|
||||
```
|
||||
|
||||
To limit the number of simultaneous tasks run in services mode, pass the maximum number immediately after the
|
||||
`--services-mode` option (for example: `--services-mode 5`).
|
||||
|
||||
:::note Notes
|
||||
* `services-mode` currently only supports Docker mode. Each service spins on its own Docker image.
|
||||
* The default `clearml-server` configuration already runs a single `clearml-agent` in services mode that listens to the
|
||||
`services` queue.
|
||||
:::
|
||||
|
||||
Launch a service task like any other task, by enqueuing it to the appropriate queue.
|
||||
|
||||
:::warning
|
||||
Do not enqueue training or inference tasks into the services queue. They will put an unnecessary load on the server.
|
||||
:::
|
||||
|
||||
## Setting Server Credentials
|
||||
|
||||
Self-hosted [ClearML Server](../deploying_clearml/clearml_server.md) comes by default with a services queue.
|
||||
By default, the server is open and does not require username and password, but it can be [password-protected](../deploying_clearml/clearml_server_security.md#user-access-security).
|
||||
In case it is password-protected, the services agent will need to be configured with server credentials (associated with a user).
|
||||
|
||||
To do that, set these environment variables on the ClearML Server machine with the appropriate credentials:
|
||||
```
|
||||
CLEARML_API_ACCESS_KEY
|
||||
CLEARML_API_SECRET_KEY
|
||||
```
|
163
docs/clearml_agent/clearml_agent_setup.md
Normal file
163
docs/clearml_agent/clearml_agent_setup.md
Normal file
@ -0,0 +1,163 @@
|
||||
---
|
||||
title: Setup
|
||||
---
|
||||
|
||||
## Installation
|
||||
|
||||
:::note
|
||||
If ClearML was previously configured, follow [this](#adding-clearml-agent-to-a-configuration-file) to add
|
||||
ClearML Agent specific configurations
|
||||
:::
|
||||
|
||||
To install ClearML Agent, execute
|
||||
```bash
|
||||
pip install clearml-agent
|
||||
```
|
||||
|
||||
:::info
|
||||
Install ClearML Agent as a system Python package and not in a Python virtual environment.
|
||||
An agent that runs in Virtual Environment Mode or Conda Environment Mode needs to create virtual environments, and
|
||||
it can't do that when running from a virtual environment.
|
||||
:::
|
||||
|
||||
## Configuration
|
||||
|
||||
1. In a terminal session, execute
|
||||
```bash
|
||||
clearml-agent init
|
||||
```
|
||||
|
||||
The setup wizard prompts for ClearML credentials (see [here](../webapp/webapp_profile.md#clearml-credentials) about obtaining credentials).
|
||||
```
|
||||
Please create new clearml credentials through the settings page in your `clearml-server` web app,
|
||||
or create a free account at https://app.clear.ml/settings/webapp-configuration
|
||||
|
||||
In the settings > workspace page, press "Create new credentials", then press "Copy to clipboard".
|
||||
|
||||
Paste copied configuration here:
|
||||
```
|
||||
|
||||
If the setup wizard's response indicates that a configuration file already exists, follow the instructions [here](#adding-clearml-agent-to-a-configuration-file).
|
||||
The wizard does not edit or overwrite existing configuration files.
|
||||
|
||||
1. At the command prompt `Paste copied configuration here:`, copy and paste the ClearML credentials and press **Enter**.
|
||||
The setup wizard confirms the credentials.
|
||||
|
||||
```
|
||||
Detected credentials key="********************" secret="*******"
|
||||
```
|
||||
|
||||
1. **Enter** to accept the default server URL, which is detected from the credentials or enter a ClearML web server URL.
|
||||
|
||||
A secure protocol, https, must be used. **Do not use http.**
|
||||
|
||||
```
|
||||
WEB Host configured to: [https://app.clear.ml]
|
||||
```
|
||||
|
||||
:::note
|
||||
If you are using a self-hosted ClearML Server, the default URL will use your domain.
|
||||
:::
|
||||
|
||||
1. Do as above for API, URL, and file servers.
|
||||
|
||||
1. The wizard responds with your configuration:
|
||||
```
|
||||
CLEARML Hosts configuration:
|
||||
Web App: https://app.clear.ml
|
||||
API: https://api.clear.ml
|
||||
File Store: https://files.clear.ml
|
||||
|
||||
Verifying credentials ...
|
||||
Credentials verified!
|
||||
```
|
||||
|
||||
1. Enter your Git username and password. Leave blank for SSH key authentication or when only using public repositories.
|
||||
|
||||
This is needed for cloning repositories by the agent.
|
||||
```
|
||||
Enter git username for repository cloning (leave blank for SSH key authentication): []
|
||||
Enter password for user '<username>':
|
||||
```
|
||||
The setup wizard confirms your git credentials.
|
||||
```
|
||||
Git repository cloning will be using user=<username> password=<password>
|
||||
```
|
||||
1. Enter an additional artifact repository, or press **Enter** if not required.
|
||||
|
||||
This is needed for installing Python packages not found in pypi.
|
||||
|
||||
```
|
||||
Enter additional artifact repository (extra-index-url) to use when installing python packages (leave blank if not required):
|
||||
```
|
||||
The setup wizard completes.
|
||||
|
||||
```
|
||||
New configuration stored in /home/<username>/clearml.conf
|
||||
CLEARML-AGENT setup completed successfully.
|
||||
```
|
||||
|
||||
The configuration file location depends upon the operating system:
|
||||
|
||||
* Linux - `~/clearml.conf`
|
||||
* Mac - `$HOME/clearml.conf`
|
||||
* Windows - `\User\<username>\clearml.conf`
|
||||
|
||||
1. Optionally, configure ClearML options for **ClearML Agent** (default docker, package manager, etc.). See the [ClearML Configuration Reference](../configs/clearml_conf.md)
|
||||
and the [ClearML Agent Environment Variables reference](../clearml_agent/clearml_agent_env_var.md).
|
||||
|
||||
:::note
|
||||
The ClearML Enterprise server provides a [configuration vault](../webapp/webapp_profile.md#configuration-vault), the contents
|
||||
of which are categorically applied on top of the agent-local configuration
|
||||
:::
|
||||
|
||||
|
||||
### Adding ClearML Agent to a Configuration File
|
||||
|
||||
In case a `clearml.conf` file already exists, add a few ClearML Agent specific configurations to it.<br/>
|
||||
|
||||
**Adding ClearML Agent to a ClearML configuration file:**
|
||||
|
||||
1. Open the ClearML configuration file for editing. Depending upon the operating system, it is:
|
||||
* Linux - `~/clearml.conf`
|
||||
* Mac - `$HOME/clearml.conf`
|
||||
* Windows - `\User\<username>\clearml.conf`
|
||||
|
||||
1. After the `api` section, add your `agent` section. For example:
|
||||
```
|
||||
agent {
|
||||
# Set GIT user/pass credentials (if user/pass are set, GIT protocol will be set to https)
|
||||
git_user=""
|
||||
git_pass=""
|
||||
# all other domains will use public access (no user/pass). Default: always send user/pass for any VCS domain
|
||||
git_host=""
|
||||
|
||||
# Force GIT protocol to use SSH regardless of the git url (Assumes GIT user/pass are blank)
|
||||
force_git_ssh_protocol: false
|
||||
|
||||
# unique name of this worker, if None, created based on hostname:process_id
|
||||
# Overridden with os environment: CLEARML_WORKER_NAME
|
||||
worker_id: ""
|
||||
}
|
||||
```
|
||||
View a complete ClearML Agent configuration file sample including an `agent` section [here](https://github.com/allegroai/clearml-agent/blob/master/docs/clearml.conf).
|
||||
|
||||
1. Save the configuration.
|
||||
|
||||
### Dynamic Environment Variables
|
||||
Dynamic ClearML Agent environment variables can be used to override any configuration setting that appears in the [`agent`](../configs/clearml_conf.md#agent-section)
|
||||
section of the `clearml.conf`.
|
||||
|
||||
The environment variable's name should be `CLEARML_AGENT__AGENT__<configuration-path>`, where `<configuration-path>`
|
||||
represents the full path to the configuration field being set. Elements of the configuration path should be separated by
|
||||
`__` (double underscore). For example, set the `CLEARML_AGENT__AGENT__DEFAULT_DOCKER__IMAGE` environment variable to
|
||||
deploy an agent with a different value to what is specified for `agent.default_docker.image` in the clearml.conf.
|
||||
|
||||
:::note NOTES
|
||||
* Since configuration fields may contain JSON-parsable values, make sure to always quote strings (otherwise the agent
|
||||
might fail to parse them)
|
||||
* To comply with environment variables standards, it is recommended to use only upper-case characters in
|
||||
environment variable keys. For this reason, ClearML Agent will always convert the configuration path specified in the
|
||||
dynamic environment variable's key to lower-case before overriding configuration values with the environment variable
|
||||
value.
|
||||
:::
|
@ -71,7 +71,7 @@ execute the tasks in the GPU queue.
|
||||
#### Docker
|
||||
Every task a cloud instance pulls will be run inside a docker container. When setting up an autoscaler app instance,
|
||||
you can specify a default container to run the tasks inside. If the task has its own container configured, it will
|
||||
override the autoscaler’s default docker image (see [Base Docker Image](../clearml_agent.md#base-docker-container)).
|
||||
override the autoscaler’s default docker image (see [Base Docker Image](../clearml_agent/clearml_agent_docker.md#base-docker-container)).
|
||||
|
||||
#### Git Configuration
|
||||
If your code is saved in a private repository, you can add your Git credentials so the ClearML Agents running on your
|
||||
|
@ -482,7 +482,7 @@ match_rules: [
|
||||
|
||||
**`agent.package_manager.use_conda_base_env`** (*bool*)
|
||||
|
||||
* When set to `True`, installation will be performed into the base Conda environment. Use in [Docker mode](../clearml_agent.md#docker-mode).
|
||||
* When set to `True`, installation will be performed into the base Conda environment. Use in [Docker mode](../clearml_agent/clearml_agent_execution_env.md#docker-mode).
|
||||
|
||||
___
|
||||
|
||||
|
@ -20,7 +20,7 @@ but can be overridden by command-line arguments.
|
||||
|**CLEARML_LOG_ENVIRONMENT** | List of Environment variable names. These environment variables will be logged in the ClearML task's configuration hyperparameters `Environment` section. When executed by a ClearML agent, these values will be set in the task's execution environment. |
|
||||
|**CLEARML_TASK_NO_REUSE** | Boolean. <br/> When set to `1`, a new task is created for every execution (see Task [reuse](../clearml_sdk/task_sdk.md#task-reuse)). |
|
||||
|**CLEARML_CACHE_DIR** | Set the path for the ClearML cache directory, where ClearML stores all downloaded content. |
|
||||
|**CLEARML_DOCKER_IMAGE** | Sets the default docker image to use when running an agent in [Docker mode](../clearml_agent.md#docker-mode). |
|
||||
|**CLEARML_DOCKER_IMAGE** | Sets the default docker image to use when running an agent in [Docker mode](../clearml_agent/clearml_agent_execution_env.md#docker-mode). |
|
||||
|**CLEARML_LOG_LEVEL** | Sets the ClearML package's log verbosity. Log levels adhere to [Python log levels](https://docs.python.org/3/library/logging.config.html#configuration-file-format): CRITICAL, ERROR, WARNING, INFO, DEBUG, NOTSET |
|
||||
|**CLEARML_SUPPRESS_UPDATE_MESSAGE** | Boolean. <br/> When set to `1`, suppresses new ClearML package version availability message. |
|
||||
|**CLEARML_DEFAULT_OUTPUT_URI** | The default output destination for model checkpoints (snapshots) and artifacts. |
|
||||
|
@ -29,7 +29,7 @@ Use the ClearML Web UI to:
|
||||
|
||||
For detailed information about the ClearML Web UI, see [User Interface](../webapp/webapp_overview.md).
|
||||
|
||||
ClearML Server also comes with a [services agent](../clearml_agent.md#services-mode) preinstalled.
|
||||
ClearML Server also comes with a [services agent](../clearml_agent/clearml_agent_services_mode.md) preinstalled.
|
||||
|
||||
## Deployment
|
||||
|
||||
|
@ -20,7 +20,7 @@ The agent also supports overriding parameter values on-the-fly without code modi
|
||||
ClearML [Hyperparameter Optimization](hpo.md) is implemented).
|
||||
|
||||
An agent can be associated with specific GPUs, enabling workload distribution. For example, on a machine with 8 GPUs you
|
||||
can allocate several GPUs to an agent and use the rest for a different workload, even through another agent (see [Dynamic GPU Allocation](../clearml_agent.md#dynamic-gpu-allocation)).
|
||||
can allocate several GPUs to an agent and use the rest for a different workload, even through another agent (see [Dynamic GPU Allocation](../clearml_agent/clearml_agent_dynamic_gpus.md)).
|
||||
|
||||
|
||||
|
||||
@ -81,7 +81,7 @@ The Agent supports the following running modes:
|
||||
* **Virtual Environment Mode** - The agent creates a new virtual environment for the experiment, installs the required
|
||||
python packages based on the Task specification, clones the code repository, applies the uncommitted changes and
|
||||
finally executes the code while monitoring it. This mode uses smart caching so packages and environments can be reused
|
||||
over multiple tasks (see [Virtual Environment Reuse](../clearml_agent.md#virtual-environment-reuse)).
|
||||
over multiple tasks (see [Virtual Environment Reuse](../clearml_agent/clearml_agent_env_caching.md#virtual-environment-reuse)).
|
||||
|
||||
ClearML Agent supports using the following package managers: `pip` (default), `conda`, `poetry`.
|
||||
|
||||
|
@ -47,7 +47,7 @@ that you need.
|
||||
accessed, [compared](../../webapp/webapp_exp_comparing.md) and [tracked](../../webapp/webapp_exp_track_visual.md).
|
||||
- [ClearML Agent](../../clearml_agent.md) does the heavy lifting. It reproduces the execution environment, clones your code,
|
||||
applies code patches, manages parameters (including overriding them on the fly), executes the code, and queues multiple tasks.
|
||||
It can even [build](../../clearml_agent.md#exporting-a-task-into-a-standalone-docker-container) the docker container for you!
|
||||
It can even [build](../../clearml_agent/clearml_agent_docker.md#exporting-a-task-into-a-standalone-docker-container) the docker container for you!
|
||||
- [ClearML Pipelines](../../pipelines/pipelines.md) ensure that steps run in the same order,
|
||||
programmatically chaining tasks together, while giving an overview of the execution pipeline's status.
|
||||
|
||||
|
@ -44,7 +44,7 @@ pip install clearml
|
||||
CLEARML_CONFIG_FILE = MyOtherClearML.conf
|
||||
```
|
||||
|
||||
For more information about running experiments inside Docker containers, see [ClearML Agent Deployment](../../clearml_agent.md#deployment)
|
||||
For more information about running experiments inside Docker containers, see [ClearML Agent Deployment](../../clearml_agent/clearml_agent_deployment.md)
|
||||
and [ClearML Agent Reference](../../clearml_agent/clearml_agent_ref.md).
|
||||
|
||||
</Collapsible>
|
||||
|
@ -53,8 +53,8 @@ required python packages, and execute and monitor the process.
|
||||
(or even multiple queues), but only a single agent will pull a Task to be executed.
|
||||
|
||||
:::tip Agent Deployment Modes
|
||||
ClearML Agents can be deployed in Virtual Environment Mode or Docker Mode. In [virtual environment mode](../../clearml_agent.md#execution-environments),
|
||||
the agent creates a new venv to execute an experiment. In [Docker mode](../../clearml_agent.md#docker-mode),
|
||||
ClearML Agents can be deployed in Virtual Environment Mode or Docker Mode. In [virtual environment mode](../../clearml_agent/clearml_agent_execution_env.md),
|
||||
the agent creates a new venv to execute an experiment. In [Docker mode](../../clearml_agent/clearml_agent_execution_env.md#docker-mode),
|
||||
the agent executes an experiment inside a Docker container. For more information, see [Running Modes](../../fundamentals/agents_and_queues.md#running-modes).
|
||||
:::
|
||||
|
||||
|
@ -8,7 +8,7 @@ on a remote or local machine, from a remote repository and your local machine.
|
||||
### Prerequisites
|
||||
|
||||
- [`clearml`](../../getting_started/ds/ds_first_steps.md) Python package installed and configured
|
||||
- [`clearml-agent`](../../clearml_agent.md#installation) running on at least one machine (to execute the experiment), configured to listen to `default` queue
|
||||
- [`clearml-agent`](../../clearml_agent/clearml_agent_setup.md#installation) running on at least one machine (to execute the experiment), configured to listen to `default` queue
|
||||
|
||||
### Executing Code from a Remote Repository
|
||||
|
||||
|
@ -8,7 +8,7 @@ run, will automatically execute the [keras_tensorboard.py](https://github.com/al
|
||||
script.
|
||||
|
||||
## Prerequisites
|
||||
* [`clearml-agent`](../../clearml_agent.md#installation) installed and configured
|
||||
* [`clearml-agent`](../../clearml_agent/clearml_agent_setup.md#installation) installed and configured
|
||||
* [`clearml`](../../getting_started/ds/ds_first_steps.md#install-clearml) installed and configured
|
||||
* [clearml](https://github.com/allegroai/clearml) repo cloned (`git clone https://github.com/allegroai/clearml.git`)
|
||||
|
||||
|
@ -10,7 +10,7 @@ A use case for this would be manual hyperparameter optimization, where a base ta
|
||||
be used when running optimization tasks.
|
||||
|
||||
## Prerequisites
|
||||
* [`clearml-agent`](../../clearml_agent.md#installation) installed and configured
|
||||
* [`clearml-agent`](../../clearml_agent/clearml_agent_setup.md#installation) installed and configured
|
||||
* [`clearml`](../../getting_started/ds/ds_first_steps.md#install-clearml) installed and configured
|
||||
* [clearml](https://github.com/allegroai/clearml) repo cloned (`git clone https://github.com/allegroai/clearml.git`)
|
||||
|
||||
@ -66,7 +66,7 @@ Make use of the container you've just built by having a ClearML agent make use o
|
||||
of the new Docker image, `new_docker`. See [Tuning Experiments](../../webapp/webapp_exp_tuning.md) for more task
|
||||
modification options.
|
||||
1. Enqueue the cloned experiment to the `default` queue.
|
||||
1. Launch a `clearml-agent` in [Docker Mode](../../clearml_agent.md#docker-mode) and assign it to the `default` queue:
|
||||
1. Launch a `clearml-agent` in [Docker Mode](../../clearml_agent/clearml_agent_execution_env.md#docker-mode) and assign it to the `default` queue:
|
||||
```console
|
||||
clearml-agent daemon --docker --queue default
|
||||
```
|
||||
|
@ -9,7 +9,7 @@ where a `clearml-agent` will run and spin an instance of the remote session.
|
||||
## Prerequisites
|
||||
|
||||
* `clearml-session` package installed (`pip install clearml-session`)
|
||||
* At least one `clearml-agent` running on a **remote** host. See [installation details](../../clearml_agent.md#installation).
|
||||
* At least one `clearml-agent` running on a **remote** host. See [installation details](../../clearml_agent/clearml_agent_setup.md#installation).
|
||||
Configure the `clearml-agent` to listen to the `default` queue (`clearml-agent daemon --queue default`)
|
||||
* An SSH client installed on machine being used. To verify, open terminal, execute `ssh`, and if no error is received,
|
||||
it should be good to go.
|
||||
|
@ -197,7 +197,7 @@ object, setting the following optimization parameters:
|
||||
## Running as a Service
|
||||
|
||||
The optimization can run as a service, if the `run_as_service` argument is set to `true`. For more information about
|
||||
running as a service, see [Services Mode](../../../clearml_agent.md#services-mode).
|
||||
running as a service, see [Services Mode](../../../clearml_agent/clearml_agent_services_mode.md).
|
||||
|
||||
```python
|
||||
# if we are running as a service, just enqueue ourselves into the services queue and let it run the optimization
|
||||
|
@ -14,7 +14,7 @@ up new instances when there aren't enough to execute pending tasks.
|
||||
Run the ClearML AWS autoscaler in one of these ways:
|
||||
* Run the [aws_autoscaler.py](https://github.com/allegroai/clearml/blob/master/examples/services/aws-autoscaler/aws_autoscaler.py)
|
||||
script locally
|
||||
* Launch through your [`services` queue](../../clearml_agent.md#services-mode)
|
||||
* Launch through your [`services` queue](../../clearml_agent/clearml_agent_services_mode.md)
|
||||
|
||||
:::note Default AMI
|
||||
The autoscaler service uses by default the `NVIDIA Deep Learning AMI v20.11.0-46a68101-e56b-41cd-8e32-631ac6e5d02b` AMI.
|
||||
@ -140,7 +140,7 @@ Execution log https://app.clear.ml/projects/142a598b5d234bebb37a57d692f5689f/exp
|
||||
```
|
||||
|
||||
### Remote Execution
|
||||
Using the `--remote` command line option will enqueue the autoscaler to your [`services` queue](../../clearml_agent.md#services-mode)
|
||||
Using the `--remote` command line option will enqueue the autoscaler to your [`services` queue](../../clearml_agent/clearml_agent_services_mode.md)
|
||||
once the configuration wizard is complete:
|
||||
|
||||
```bash
|
||||
@ -162,7 +162,7 @@ page under **HYPERPARAMETERS > General**.
|
||||
|
||||
The task can be reused to launch another autoscaler instance: clone the task, then edit its parameters for the instance
|
||||
types and budget configuration, and enqueue the task for execution (you'll typically want to use a ClearML Agent running
|
||||
in [services mode](../../clearml_agent.md#services-mode) for such service tasks).
|
||||
in [services mode](../../clearml_agent/clearml_agent_services_mode.md) for such service tasks).
|
||||
|
||||
### Console
|
||||
|
||||
|
@ -55,7 +55,7 @@ an `APIClient` object that establishes a session with the ClearML Server, and ac
|
||||
The experiment's hyperparameters are explicitly logged to ClearML using the [`Task.connect`](../../references/sdk/task.md#connect)
|
||||
method. View them in the WebApp, in the experiment's **CONFIGURATION** page under **HYPERPARAMETERS > General**.
|
||||
|
||||
The task can be reused. Clone the task, edit its parameters, and enqueue the task to run in ClearML Agent [services mode](../../clearml_agent.md#services-mode).
|
||||
The task can be reused. Clone the task, edit its parameters, and enqueue the task to run in ClearML Agent [services mode](../../clearml_agent/clearml_agent_services_mode.md).
|
||||
|
||||

|
||||
|
||||
|
@ -44,7 +44,7 @@ to your needs and enqueue for execution directly from the ClearML UI.
|
||||
|
||||
Run the monitoring service in one of these ways:
|
||||
* Run locally
|
||||
* Run in ClearML Agent [services mode](../../clearml_agent.md#services-mode)
|
||||
* Run in ClearML Agent [services mode](../../clearml_agent/clearml_agent_services_mode.md)
|
||||
|
||||
To run the monitoring service:
|
||||
|
||||
@ -85,7 +85,7 @@ page under **HYPERPARAMETERS > Args**.
|
||||

|
||||
|
||||
The task can be reused to launch another monitor instance: clone the task, edit its parameters, and enqueue the task for
|
||||
execution (you'll typically want to use a ClearML Agent running in [services mode](../../clearml_agent.md#services-mode)
|
||||
execution (you'll typically want to use a ClearML Agent running in [services mode](../../clearml_agent/clearml_agent_services_mode.md)
|
||||
for such service tasks).
|
||||
|
||||
## Console
|
||||
|
@ -10,7 +10,7 @@ example script.
|
||||
* Clone the [clearml](https://github.com/allegroai/clearml) repository.
|
||||
* Install the [requirements](https://github.com/allegroai/clearml/blob/master/examples/frameworks/tensorflow/requirements.txt)
|
||||
for the TensorFlow examples.
|
||||
* Have **ClearML Agent** [installed and configured](../../clearml_agent.md#installation).
|
||||
* Have **ClearML Agent** [installed and configured](../../clearml_agent/clearml_agent_setup.md#installation).
|
||||
|
||||
## Step 1: Run the Experiment
|
||||
|
||||
|
@ -78,7 +78,7 @@ See [Explicit Reporting Tutorial](../guides/reporting/explicit_reporting.md).
|
||||
|
||||
## Remote Execution
|
||||
ClearML logs all the information required to reproduce an experiment on a different machine (installed packages,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent) listens to designated queues and when a task is enqueued,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent.md) listens to designated queues and when a task is enqueued,
|
||||
the agent pulls it, recreates its execution environment, and runs it, reporting its scalars, plots, etc. to the
|
||||
experiment manager.
|
||||
|
||||
@ -104,7 +104,7 @@ with the new configuration on a remote machine:
|
||||
* Edit the hyperparameters and/or other details
|
||||
* Enqueue the task
|
||||
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent).
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent.md).
|
||||
|
||||
### Executing a Task Remotely
|
||||
|
||||
|
@ -76,7 +76,7 @@ See [Explicit Reporting Tutorial](../guides/reporting/explicit_reporting.md).
|
||||
|
||||
## Remote Execution
|
||||
ClearML logs all the information required to reproduce an experiment on a different machine (installed packages,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent) listens to designated queues and when a task is enqueued,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent.md) listens to designated queues and when a task is enqueued,
|
||||
the agent pulls it, recreates its execution environment, and runs it, reporting its scalars, plots, etc. to the
|
||||
experiment manager.
|
||||
|
||||
@ -102,7 +102,7 @@ with the new configuration on a remote machine:
|
||||
* Edit the hyperparameters and/or other details
|
||||
* Enqueue the task
|
||||
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent).
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent.md).
|
||||
|
||||
### Executing a Task Remotely
|
||||
|
||||
|
@ -76,7 +76,7 @@ See [Explicit Reporting Tutorial](../guides/reporting/explicit_reporting.md).
|
||||
|
||||
## Remote Execution
|
||||
ClearML logs all the information required to reproduce an experiment on a different machine (installed packages,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent) listens to designated queues and when a task is enqueued,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent.md) listens to designated queues and when a task is enqueued,
|
||||
the agent pulls it, recreates its execution environment, and runs it, reporting its scalars, plots, etc. to the
|
||||
experiment manager.
|
||||
|
||||
@ -102,7 +102,7 @@ with the new configuration on a remote machine:
|
||||
* Edit the hyperparameters and/or other details
|
||||
* Enqueue the task
|
||||
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent).
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent.md).
|
||||
|
||||
### Executing a Task Remotely
|
||||
|
||||
|
@ -88,7 +88,7 @@ and debug samples, plots, and scalars logged to TensorBoard
|
||||
|
||||
## Remote Execution
|
||||
ClearML logs all the information required to reproduce an experiment on a different machine (installed packages,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent) listens to designated queues and when a task is enqueued,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent.md) listens to designated queues and when a task is enqueued,
|
||||
the agent pulls it, recreates its execution environment, and runs it, reporting its scalars, plots, etc. to the
|
||||
experiment manager.
|
||||
|
||||
@ -114,7 +114,7 @@ with the new configuration on a remote machine:
|
||||
* Edit the hyperparameters and/or other details
|
||||
* Enqueue the task
|
||||
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent).
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent.md).
|
||||
|
||||
### Executing a Task Remotely
|
||||
|
||||
|
@ -77,7 +77,7 @@ See [Explicit Reporting Tutorial](../guides/reporting/explicit_reporting.md).
|
||||
|
||||
## Remote Execution
|
||||
ClearML logs all the information required to reproduce an experiment on a different machine (installed packages,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent) listens to designated queues and when a task is enqueued,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent.md) listens to designated queues and when a task is enqueued,
|
||||
the agent pulls it, recreates its execution environment, and runs it, reporting its scalars, plots, etc. to the
|
||||
experiment manager.
|
||||
|
||||
@ -103,7 +103,7 @@ with the new configuration on a remote machine:
|
||||
* Edit the hyperparameters and/or other details
|
||||
* Enqueue the task
|
||||
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent).
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent.md).
|
||||
|
||||
### Executing a Task Remotely
|
||||
|
||||
|
@ -73,7 +73,7 @@ See [Explicit Reporting Tutorial](../guides/reporting/explicit_reporting.md).
|
||||
|
||||
## Remote Execution
|
||||
ClearML logs all the information required to reproduce an experiment on a different machine (installed packages,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent) listens to designated queues and when a task is enqueued,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent.md) listens to designated queues and when a task is enqueued,
|
||||
the agent pulls it, recreates its execution environment, and runs it, reporting its scalars, plots, etc. to the
|
||||
experiment manager.
|
||||
|
||||
@ -99,7 +99,7 @@ with the new configuration on a remote machine:
|
||||
* Edit the hyperparameters and/or other details
|
||||
* Enqueue the task
|
||||
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent).
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent.md).
|
||||
|
||||
### Executing a Task Remotely
|
||||
|
||||
|
@ -97,7 +97,7 @@ additional tools, like argparse, TensorBoard, and matplotlib:
|
||||
|
||||
## Remote Execution
|
||||
ClearML logs all the information required to reproduce an experiment on a different machine (installed packages,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent) listens to designated queues and when a task is enqueued,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent.md) listens to designated queues and when a task is enqueued,
|
||||
the agent pulls it, recreates its execution environment, and runs it, reporting its scalars, plots, etc. to the
|
||||
experiment manager.
|
||||
|
||||
@ -123,7 +123,7 @@ with the new configuration on a remote machine:
|
||||
* Edit the hyperparameters and/or other details
|
||||
* Enqueue the task
|
||||
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent).
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent.md).
|
||||
|
||||
### Executing a Task Remotely
|
||||
|
||||
|
@ -103,7 +103,7 @@ See [Explicit Reporting Tutorial](../guides/reporting/explicit_reporting.md).
|
||||
|
||||
## Remote Execution
|
||||
ClearML logs all the information required to reproduce an experiment on a different machine (installed packages,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent) listens to designated queues and when a task is enqueued,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent.md) listens to designated queues and when a task is enqueued,
|
||||
the agent pulls it, recreates its execution environment, and runs it, reporting its scalars, plots, etc. to the
|
||||
experiment manager.
|
||||
|
||||
@ -129,7 +129,7 @@ with the new configuration on a remote machine:
|
||||
* Edit the hyperparameters and/or other details
|
||||
* Enqueue the task
|
||||
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent).
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent.md).
|
||||
|
||||
### Executing a Task Remotely
|
||||
|
||||
|
@ -79,7 +79,7 @@ additional tools, like Matplotlib:
|
||||
|
||||
## Remote Execution
|
||||
ClearML logs all the information required to reproduce an experiment on a different machine (installed packages,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent) listens to designated queues and when a task is enqueued,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent.md) listens to designated queues and when a task is enqueued,
|
||||
the agent pulls it, recreates its execution environment, and runs it, reporting its scalars, plots, etc. to the
|
||||
experiment manager.
|
||||
|
||||
@ -105,7 +105,7 @@ with the new configuration on a remote machine:
|
||||
* Edit the hyperparameters and/or other details
|
||||
* Enqueue the task
|
||||
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent).
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent.md).
|
||||
|
||||
### Executing a Task Remotely
|
||||
|
||||
|
@ -90,7 +90,7 @@ ClearML's automatic logging of parameters defined using `absl.flags`
|
||||
|
||||
## Remote Execution
|
||||
ClearML logs all the information required to reproduce an experiment on a different machine (installed packages,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent) listens to designated queues and when a task is enqueued,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent.md) listens to designated queues and when a task is enqueued,
|
||||
the agent pulls it, recreates its execution environment, and runs it, reporting its scalars, plots, etc. to the
|
||||
experiment manager.
|
||||
|
||||
@ -116,7 +116,7 @@ with the new configuration on a remote machine:
|
||||
* Edit the hyperparameters and/or other details
|
||||
* Enqueue the task
|
||||
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent).
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent.md).
|
||||
|
||||
### Executing a Task Remotely
|
||||
|
||||
|
@ -104,7 +104,7 @@ additional tools, like Matplotlib and scikit-learn:
|
||||
|
||||
## Remote Execution
|
||||
ClearML logs all the information required to reproduce an experiment on a different machine (installed packages,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent) listens to designated queues and when a task is enqueued,
|
||||
uncommitted changes etc.). The [ClearML Agent](../clearml_agent.md) listens to designated queues and when a task is enqueued,
|
||||
the agent pulls it, recreates its execution environment, and runs it, reporting its scalars, plots, etc. to the
|
||||
experiment manager.
|
||||
|
||||
@ -130,7 +130,7 @@ with the new configuration on a remote machine:
|
||||
* Edit the hyperparameters and/or other details
|
||||
* Enqueue the task
|
||||
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent).
|
||||
The ClearML Agent executing the task will use the new values to [override any hard coded values](../clearml_agent.md).
|
||||
|
||||
### Executing a Task Remotely
|
||||
|
||||
|
@ -16,7 +16,7 @@ meets resource needs:
|
||||
* [Clearml Session CLI](apps/clearml_session.md) - Launch an interactive JupyterLab, VS Code, and SSH session on a remote machine:
|
||||
* Automatically store and sync your [interactive session workspace](apps/clearml_session.md#storing-and-synchronizing-workspace)
|
||||
* Replicate a previously executed experiment's execution environment and [interactively execute and debug](apps/clearml_session.md#starting-a-debugging-session) it on a remote session
|
||||
* Develop directly inside your Kubernetes pods ([see ClearML Agent](clearml_agent.md#kubernetes))
|
||||
* Develop directly inside your Kubernetes pods ([see ClearML Agent](clearml_agent/clearml_agent_deployment.md#kubernetes))
|
||||
* And more!
|
||||
* GUI Applications (available under ClearML Enterprise Plan) - These apps provide local links to access JupyterLab or
|
||||
VS Code on a remote machine over a secure and encrypted SSH connection, letting you use the IDE as if you're running
|
||||
|
@ -38,7 +38,7 @@ For more information about how autoscalers work, see [Autoscalers Overview](../.
|
||||
* **Workers Prefix** (optional) - A Prefix added to workers' names, associating them with this autoscaler
|
||||
* **Polling Interval** (optional) - Time period in minutes at which the designated queue is polled for new tasks
|
||||
* **Use docker mode** - If selected, tasks enqueued to the autoscaler will be executed by ClearML Agents running in
|
||||
[Docker mode](../../clearml_agent.md#docker-mode)
|
||||
[Docker mode](../../clearml_agent/clearml_agent_execution_env.md#docker-mode)
|
||||
* **Base Docker Image** (optional) - Available when `Use docker mode` is selected: Default Docker image in which the
|
||||
ClearML Agent will run. Provide an image stored in a Docker artifactory so instances can automatically fetch it
|
||||
* **Compute Resources**
|
||||
|
@ -39,7 +39,7 @@ For more information about how autoscalers work, see [Autoscalers Overview](../.
|
||||
* Git User
|
||||
* Git Password / Personal Access Token
|
||||
* **Use docker mode** - If selected, tasks enqueued to the autoscaler will be executed by ClearML Agents running in
|
||||
[Docker mode](../../clearml_agent.md#docker-mode)
|
||||
[Docker mode](../../clearml_agent/clearml_agent_execution_env.md#docker-mode)
|
||||
* **Base Docker Image** (optional) - Available when `Use docker mode` is selected. Default Docker image in which the ClearML Agent will run. Provide an image stored in a
|
||||
Docker artifactory so VM instances can automatically fetch it
|
||||
* **Compute Resources**
|
||||
|
@ -88,7 +88,7 @@ using to set up an environment (`pip` or `conda`) are available. Select which re
|
||||
|
||||
### Container
|
||||
The Container section list the following information:
|
||||
* Image - a pre-configured Docker that ClearML Agent will use to remotely execute this experiment (see [Building Docker containers](../clearml_agent.md#exporting-a-task-into-a-standalone-docker-container))
|
||||
* Image - a pre-configured Docker that ClearML Agent will use to remotely execute this experiment (see [Building Docker containers](../clearml_agent/clearml_agent_docker.md))
|
||||
* Arguments - add Docker arguments
|
||||
* Setup shell script - a bash script to be executed inside the Docker before setting up the experiment's environment
|
||||
|
||||
|
@ -70,7 +70,7 @@ Select source code by changing any of the following:
|
||||
|
||||
|
||||
#### Base Docker Image
|
||||
Select a pre-configured Docker that **ClearML Agent** will use to remotely execute this experiment (see [Building Docker containers](../clearml_agent.md#exporting-a-task-into-a-standalone-docker-container)).
|
||||
Select a pre-configured Docker that **ClearML Agent** will use to remotely execute this experiment (see [Building Docker containers](../clearml_agent/clearml_agent_docker.md)).
|
||||
|
||||
**To add, change, or delete a base Docker image:**
|
||||
|
||||
|
@ -36,7 +36,13 @@ module.exports = {
|
||||
{'ClearML Fundamentals': ['fundamentals/projects', 'fundamentals/task', 'fundamentals/hyperparameters', 'fundamentals/artifacts', 'fundamentals/logger', 'fundamentals/agents_and_queues',
|
||||
'fundamentals/hpo']},
|
||||
{'ClearML SDK': ['clearml_sdk/clearml_sdk', 'clearml_sdk/task_sdk', 'clearml_sdk/model_sdk', 'clearml_sdk/apiclient_sdk']},
|
||||
'clearml_agent',
|
||||
{'ClearML Agent':
|
||||
['clearml_agent', 'clearml_agent/clearml_agent_setup', 'clearml_agent/clearml_agent_deployment',
|
||||
'clearml_agent/clearml_agent_execution_env', 'clearml_agent/clearml_agent_env_caching',
|
||||
'clearml_agent/clearml_agent_dynamic_gpus', 'clearml_agent/clearml_agent_fractional_gpus',
|
||||
'clearml_agent/clearml_agent_services_mode', 'clearml_agent/clearml_agent_docker',
|
||||
'clearml_agent/clearml_agent_google_colab', 'clearml_agent/clearml_agent_scheduling'
|
||||
]},
|
||||
{'Cloud Autoscaling': [
|
||||
'cloud_autoscaling/autoscaling_overview',
|
||||
{'Autoscaler Apps': [
|
||||
|
Loading…
Reference in New Issue
Block a user