This commit is contained in:
revital
2025-02-27 13:16:52 +02:00
153 changed files with 5096 additions and 1500 deletions

View File

@@ -28,4 +28,4 @@ jobs:
# Runs a single command using the runners shell
- name: Run a one-line script
run: |
grep -r -Eoh "(https?://github.com/[a-zA-Z0-9./?=_%:-]*)" $GITHUB_WORKSPACE | sort -u | grep -v "://github.com/allegroai/clearml-docs" | xargs -n 1 sh -c 'curl --output /dev/null --silent --head --fail $0 || curl --output /dev/null --silent --head --fail --write-out "%{url_effective}: %{http_code}\n" $0'
grep -r -Eoh "(https?://github.com/[a-zA-Z0-9./?=_%:-]*)" $GITHUB_WORKSPACE | sort -u | grep -v "://github.com/clearml/clearml-docs" | xargs -n 1 sh -c 'curl --output /dev/null --silent --head --fail $0 || curl --output /dev/null --silent --head --fail --write-out "%{url_effective}: %{http_code}\n" $0'

View File

@@ -25,7 +25,7 @@ jobs:
env:
INCOMING_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
with:
text: Link Checker failure in github.com/allegroai/clearl-docs
text: Link Checker failure in github.com/clearml/clearl-docs
blocks: |
[
{"type": "section", "text": {"type": "mrkdwn", "text": "Testing!"}}

View File

@@ -34,7 +34,7 @@ of the optimization results in table and graph forms.
|`--objective-metric-sign`| Optimization target, whether to maximize or minimize the value of the objective metric specified. Possible values: "min", "max", "min_global", "max_global". For more information, see [Optimization Objective](#optimization-objective). |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--objective-metric-title`| Objective metric title to maximize/minimize (e.g. 'validation').|<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--optimization-time-limit`|The maximum time (minutes) for the optimization to run. The default is `None`, indicating no time limit.|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|`--optimizer-class`|The optimizer to use. Possible values are: OptimizerOptuna (default), OptimizerBOHB, GridSearch, RandomSearch. For more information, see [Supported Optimizers](../fundamentals/hpo.md#supported-optimizers). |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--optimizer-class`|The optimizer to use. Possible values are: OptimizerOptuna (default), OptimizerBOHB, GridSearch, RandomSearch. For more information, see [Supported Optimizers](../clearml_sdk/hpo_sdk.md#supported-optimizers). |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--params-search`|Parameters space for optimization. See more information in [Specifying the Parameter Space](#specifying-the-parameter-space). |<img src="/docs/latest/icons/ico-optional-yes.svg" alt="Yes" className="icon size-md center-md" />|
|`--params-override`|Additional parameters of the base task to override for this parameter search. Use the following JSON format for each parameter: `{"name": "param_name", "value": <new_value>}`. Windows users, see [JSON format note](#json_note).|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|
|`--pool-period-min`|The time between two consecutive polls (minutes).|<img src="/docs/latest/icons/ico-optional-no.svg" alt="No" className="icon size-md center-md" />|

View File

@@ -23,7 +23,7 @@ VS Code remote sessions use ports 8878 and 8898 respectively.
## Prerequisites
* `clearml` installed and configured. See [Getting Started](../getting_started/ds/ds_first_steps.md) for details.
* `clearml` installed and configured. See [ClearML Setup](../clearml_sdk/clearml_sdk_setup) for details.
* At least one `clearml-agent` running on a remote host. See [installation](../clearml_agent/clearml_agent_setup.md#installation) for details.
* An SSH client installed on your machine. To verify, open your terminal and execute `ssh`. If you did not receive an
error, you are good to go.
@@ -56,7 +56,7 @@ error, you are good to go.
1. The session Task is enqueued in the selected queue, and a ClearML Agent pulls and executes it. The agent downloads the appropriate IDE(s) and
launches it.
1. Once the agent finishes the initial setup of the interactive Task, the local `cleaml-session` connects to the host
1. Once the agent finishes the initial setup of the interactive Task, the local `clearml-session` connects to the host
machine via SSH, and tunnels both SSH and IDE over the SSH connection. If a container is specified, the
IDE environment runs inside of it.
@@ -142,7 +142,7 @@ sessions:
maxServices: 20
```
For more information, see [Kubernetes](../clearml_agent/clearml_agent_deployment.md#kubernetes).
For more information, see [Kubernetes](../clearml_agent/clearml_agent_deployment_k8s.md).
### Installing Requirements

View File

@@ -11,7 +11,7 @@ line arguments, Python module dependencies, and a requirements.txt file!
## What Is ClearML Task For?
* Launching off-the-shelf code on a remote machine with dedicated resources (e.g. GPU)
* Running [hyperparameter optimization](../fundamentals/hpo.md) on a codebase that is still not in ClearML
* Running [hyperparameter optimization](../getting_started/hpo.md) on a codebase that is still not in ClearML
* Creating a pipeline from an assortment of scripts, that you need to turn into ClearML tasks
* Running some code on a remote machine, either using an on-prem cluster or on the cloud

View File

@@ -9,7 +9,8 @@ See [Hyper-Datasets](../hyperdatasets/overview.md) for ClearML's advanced querya
The following are some recommendations for using ClearML Data.
![Dataset UI gif](../img/gif/dataset.gif)
![Dataset UI gif](../img/gif/dataset.gif#light-mode-only)
![Dataset UI gif](../img/gif/dataset_dark.gif#dark-mode-only)
## Versioning Datasets
@@ -25,7 +26,7 @@ version contents ready to be updated.
Organize the datasets according to use-cases and use tags. This makes managing multiple datasets and
accessing the most updated datasets for different use-cases easier.
Like any ClearML tasks, datasets can be organized into [projects (and subprojects)](../fundamentals/projects.md#creating-subprojects).
Like any ClearML tasks, datasets can be organized into [projects (and subprojects)](../fundamentals/projects.md#creating-projects-and-subprojects).
Additionally, when creating a dataset, tags can be applied to the dataset, which will make searching for the dataset easier.
Organizing your datasets into projects by use-case makes it easier to access the most recent dataset version for that use-case.
@@ -55,5 +56,5 @@ serves as a dataset's single point of truth, you can schedule a script which use
will update the dataset based on the modifications made to the folder. This way, there is no need to manually modify a dataset.
This functionality will also track the modifications made to a folder.
See the sync function with the [CLI](clearml_data_cli.md#sync) or [SDK](clearml_data_sdk.md#syncing-local-storage)
See the sync function with the [CLI](../clearml_data/clearml_data_cli.md#sync) or [SDK](../clearml_data/clearml_data_sdk.md#syncing-local-storage)
interface.

View File

@@ -24,12 +24,12 @@ During early stages of model development, while code is still being modified hea
These setups can be folded into each other and that's great! If you have a GPU machine for each researcher, that's awesome!
The goal of this phase is to get a code, dataset, and environment set up, so you can start digging to find the best model!
- [ClearML SDK](../../clearml_sdk/clearml_sdk.md) should be integrated into your code (check out [Getting Started](ds_first_steps.md)).
- [ClearML SDK](../clearml_sdk/clearml_sdk.md) should be integrated into your code (check out [ClearML Setup](../clearml_sdk/clearml_sdk_setup.md)).
This helps visualizing the results and tracking progress.
- [ClearML Agent](../../clearml_agent.md) helps moving your work to other machines without the hassle of rebuilding the environment every time,
- [ClearML Agent](../clearml_agent.md) helps moving your work to other machines without the hassle of rebuilding the environment every time,
while also creating an easy queue interface that easily lets you drop your tasks to be executed one by one
(great for ensuring that the GPUs are churning during the weekend).
- [ClearML Session](../../apps/clearml_session.md) helps with developing on remote machines, in the same way that you'd develop on your local laptop!
- [ClearML Session](../apps/clearml_session.md) helps with developing on remote machines, in the same way that you'd develop on your local laptop!
## Train Remotely
@@ -43,12 +43,12 @@ yields the best performing model for your task!
Visualization and comparison dashboards keep your sanity at bay! At this stage you usually have a docker container with all the binaries
that you need.
- [ClearML SDK](../../clearml_sdk/clearml_sdk.md) ensures that all the metrics, parameters and Models are automatically logged and can later be
accessed, [compared](../../webapp/webapp_exp_comparing.md) and [tracked](../../webapp/webapp_exp_track_visual.md).
- [ClearML Agent](../../clearml_agent.md) does the heavy lifting. It reproduces the execution environment, clones your code,
- [ClearML SDK](../clearml_sdk/clearml_sdk.md) ensures that all the metrics, parameters and Models are automatically logged and can later be
accessed, [compared](../webapp/webapp_exp_comparing.md) and [tracked](../webapp/webapp_exp_track_visual.md).
- [ClearML Agent](../clearml_agent.md) does the heavy lifting. It reproduces the execution environment, clones your code,
applies code patches, manages parameters (including overriding them on the fly), executes the code, and queues multiple tasks.
It can even [build](../../clearml_agent/clearml_agent_docker.md#exporting-a-task-into-a-standalone-docker-container) the docker container for you!
- [ClearML Pipelines](../../pipelines/pipelines.md) ensure that steps run in the same order,
It can even [build](../../clearml_agent/clearml_agent_docker_exec#exporting-a-task-into-a-standalone-docker-container) the docker container for you!
- [ClearML Pipelines](../pipelines/pipelines.md) ensure that steps run in the same order,
programmatically chaining tasks together, while giving an overview of the execution pipeline's status.
**Your entire environment should magically be able to run on any machine, without you working hard.**

View File

@@ -7,10 +7,10 @@ From training models to data processing to deploying to production.
## Development - Preparing for Automation
Basically, track everything. There is nothing that is not worth having visibility to.
If you are afraid of clutter, use the archive option, and set up your own [cleanup service](../../guides/services/cleanup_service.md).
If you are afraid of clutter, use the archive option, and set up your own [cleanup service](../guides/services/cleanup_service.md).
- Track the code base. There is no reason not to add metrics to any process in your workflow, even if it is not directly ML. Visibility is key to iterative improvement of your code / workflow.
- Create per-project [leaderboards](../../guides/ui/building_leader_board.md) based on custom columns
- Create per-project [leaderboards](../guides/ui/building_leader_board.md) based on custom columns
(hyperparameters and performance accuracy), and bookmark them (full URL will always reproduce the same view and table).
- Share tasks with your colleagues and team-leaders.
Invite more people to see how your project is progressing, and suggest they add metric reporting for their own.
@@ -18,23 +18,23 @@ If you are afraid of clutter, use the archive option, and set up your own [clean
## Clone Tasks
Define a ClearML Task with one of the following options:
- Run the actual code with the `Task.init()` call. This will create and auto-populate the Task in CleaML (including Git Repo / Python Packages / Command line etc.).
- Register local / remote code repository with `clearml-task`. See [details](../../apps/clearml_task.md).
- Run the actual code with the `Task.init()` call. This will create and auto-populate the Task in ClearML (including Git Repo / Python Packages / Command line etc.).
- Register local / remote code repository with `clearml-task`. See [details](../apps/clearml_task.md).
Once you have a Task in ClearML, you can clone and edit its definitions in the UI, then launch it on one of your nodes with [ClearML Agent](../../clearml_agent.md).
Once you have a Task in ClearML, you can clone and edit its definitions in the UI, then launch it on one of your nodes with [ClearML Agent](../clearml_agent.md).
## Advanced Automation
- Create daily / weekly cron jobs for retraining best performing models on.
- Create data monitoring & scheduling and launch inference jobs to test performance on any new coming dataset.
- Once there are two or more tasks that run after another, group them together into a [pipeline](../../pipelines/pipelines.md).
- Once there are two or more tasks that run after another, group them together into a [pipeline](../pipelines/pipelines.md).
## Manage Your Data
Use [ClearML Data](../../clearml_data/clearml_data.md) to version your data, then link it to running tasks for easy reproduction.
Use [ClearML Data](../clearml_data/clearml_data.md) to version your data, then link it to running tasks for easy reproduction.
Make datasets machine agnostic (i.e. store original dataset in a shared storage location, e.g. shared-folder / S3 / Gs / Azure).
ClearML Data supports efficient Dataset storage and caching, differentiable and compressed.
## Scale Your Work
Use [ClearML Agent](../../clearml_agent.md) to scale work. Install the agent machines (remote or local) and manage
Use [ClearML Agent](../clearml_agent.md) to scale work. Install the agent machines (remote or local) and manage
training workload with it.
Improve team collaboration by transparent resource monitoring, always know what is running where.

View File

@@ -0,0 +1,18 @@
---
title: Build Interactive Model Demos
---
:::info Enterprise Feature
The Gradio Launcher and Streamlit Apps are available under the ClearML Enterprise plan.
:::
ClearML supports quickly creating web-based interfaces for AI models, making it easier to
test, demo, and iterate on new capabilities. With ClearML's built-in orchestration, you can effortlessly launch, manage,
and optimize AI-powered applications to accelerate their way to production.
ClearML provides the following applications for building an interactive model interface:
* [Gradio Launcher](webapp/applications/apps_gradio.md)
* [Streamlit Launcher](webapp/applications/apps_streamlit.md)
![Gradio Dashboard](img/apps_gradio.png#light-mode-only)
![Gradio Dashboard](img/apps_gradio_dark.png#dark-mode-only)

View File

@@ -60,9 +60,9 @@ original values:
* Code-level configuration instrumented with [`Task.connect()`](references/sdk/task.md#connect) will be overridden by modified hyperparameters
ClearML Agent can be deployed in various setups to suit different workflows and infrastructure needs:
* [Bare Metal](clearml_agent/clearml_agent_deployment.md#spinning-up-an-agent)
* [Kubernetes](clearml_agent/clearml_agent_deployment.md#kubernetes)
* [Slurm](clearml_agent/clearml_agent_deployment.md#slurm)
* [Bare Metal](clearml_agent/clearml_agent_deployment_bare_metal.md#spinning-up-an-agent)
* [Kubernetes](clearml_agent/clearml_agent_deployment_k8s.md)
* [Slurm](clearml_agent/clearml_agent_deployment_slurm.md)
* [Google Colab](guides/ide/google_colab.md)
## References

View File

@@ -1,292 +0,0 @@
---
title: Deployment
---
## Spinning Up an Agent
You can spin up an agent on any machine: on-prem and/or cloud instance. When spinning up an agent, you assign it to
service a queue(s). Utilize the machine by enqueuing tasks to the queue that the agent is servicing, and the agent will
pull and execute the tasks.
:::tip cross-platform execution
ClearML Agent is platform-agnostic. When using the ClearML Agent to execute tasks cross-platform, set platform
specific environment variables before launching the agent.
For example, to run an agent on an ARM device, set the core type environment variable before spinning up the agent:
```bash
export OPENBLAS_CORETYPE=ARMV8
clearml-agent daemon --queue <queue_name>
```
:::
### Executing an Agent
To execute an agent, listening to a queue, run:
```bash
clearml-agent daemon --queue <queue_name>
```
### Executing in Background
To execute an agent in the background, run:
```bash
clearml-agent daemon --queue <execution_queue_to_pull_from> --detached
```
### Stopping Agents
To stop an agent running in the background, run:
```bash
clearml-agent daemon <arguments> --stop
```
### Allocating Resources
To specify GPUs associated with the agent, add the `--gpus` flag.
:::info Docker Mode
Make sure to include the `--docker` flag, as GPU management through the agent is only supported in [Docker Mode](clearml_agent_execution_env.md#docker-mode).
:::
To execute multiple agents on the same machine (usually assigning GPU for the different agents), run:
```bash
clearml-agent daemon --gpus 0 --queue default --docker
clearml-agent daemon --gpus 1 --queue default --docker
```
To allocate more than one GPU, provide a list of allocated GPUs
```bash
clearml-agent daemon --gpus 0,1 --queue dual_gpu --docker
```
### Queue Prioritization
A single agent can listen to multiple queues. The priority is set by their order.
```bash
clearml-agent daemon --queue high_q low_q
```
This ensures the agent first tries to pull a Task from the `high_q` queue, and only if it is empty, the agent will try to pull
from the `low_q` queue.
To make sure an agent pulls from all queues equally, add the `--order-fairness` flag.
```bash
clearml-agent daemon --queue group_a group_b --order-fairness
```
It will make sure the agent will pull from the `group_a` queue, then from `group_b`, then back to `group_a`, etc. This ensures
that `group_a` or `group_b` will not be able to starve one another of resources.
### SSH Access
By default, ClearML Agent maps the host's `~/.ssh` into the container's `/root/.ssh` directory (configurable,
see [clearml.conf](../configs/clearml_conf.md#docker_internal_mounts)).
If you want to use existing auth sockets with ssh-agent, you can verify your host ssh-agent is working correctly with:
```commandline
echo $SSH_AUTH_SOCK
```
You should see a path to a temporary file, something like this:
```console
/tmp/ssh-<random>/agent.<random>
```
Then run your `clearml-agent` in Docker mode, which will automatically detect the `SSH_AUTH_SOCK` environment variable,
and mount the socket into any container it spins.
You can also explicitly set the `SSH_AUTH_SOCK` environment variable when executing an agent. The command below will
execute an agent in Docker mode and assign it to service a queue. The agent will have access to
the SSH socket provided in the environment variable.
```
SSH_AUTH_SOCK=<file_socket> clearml-agent daemon --gpus <your config> --queue <your queue name> --docker
```
## Kubernetes
Agents can be deployed bare-metal or as Docker containers in a Kubernetes cluster. ClearML Agent adds missing scheduling capabilities to Kubernetes, enabling more flexible automation from code while leveraging all of ClearML Agent's features.
ClearML Agent is deployed onto a Kubernetes cluster using **Kubernetes-Glue**, which maps ClearML jobs directly to Kubernetes jobs. This allows seamless task execution and resource allocation across your cluster.
### Deployment Options
You can deploy ClearML Agent onto Kubernetes using one of the following methods:
1. **ClearML Agent Helm Chart (Recommended)**:
Use the [ClearML Agent Helm Chart](https://github.com/clearml/clearml-helm-charts/tree/main/charts/clearml-agent) to spin up an agent pod acting as a controller. This is the recommended and scalable approach.
2. **K8s Glue Script**:
Run a [K8s Glue script](https://github.com/clearml/clearml-agent/blob/master/examples/k8s_glue_example.py) on a Kubernetes CPU node. This approach is less scalable and typically suited for simpler use cases.
### How It Works
The ClearML Kubernetes-Glue performs the following:
- Pulls jobs from the ClearML execution queue.
- Prepares a Kubernetes job based on a provided YAML template.
- Inside each job pod, the `clearml-agent`:
- Installs the required environment for the task.
- Executes and monitors the task process.
:::important Enterprise Features
ClearML Enterprise adds advanced Kubernetes features:
- **Multi-Queue Support**: Service multiple ClearML queues within the same Kubernetes cluster.
- **Pod-Specific Templates**: Define resource configurations per queue using pod templates.
For example, you can configure resources for different queues as shown below:
```yaml
agentk8sglue:
queues:
example_queue_1:
templateOverrides:
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
resources:
limits:
nvidia.com/gpu: 1
example_queue_2:
templateOverrides:
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-40GB
resources:
limits:
nvidia.com/gpu: 2
```
:::
## Slurm
:::important Enterprise Feature
Slurm Glue is available under the ClearML Enterprise plan.
:::
Agents can be deployed bare-metal or inside [`Singularity`](https://docs.sylabs.io/guides/3.5/user-guide/introduction.html)
containers in Linux clusters managed with Slurm.
ClearML Agent Slurm Glue maps jobs to Slurm batch scripts: associate a ClearML queue to a batch script template, then
when a Task is pushed into the queue, it will be converted and executed as an `sbatch` job according to the sbatch
template specification attached to the queue.
1. Install the Slurm Glue on a machine where you can run `sbatch` / `squeue` etc.
```
pip3 install -U --extra-index-url https://*****@*****.allegro.ai/repository/clearml_agent_slurm/simple clearml-agent-slurm
```
1. Create a batch template. Make sure to set the `SBATCH` variables to the resources you want to attach to the queue.
The script below sets up an agent to run bare-metal, creating a virtual environment per job. For example:
```
#!/bin/bash
# available template variables (default value separator ":")
# ${CLEARML_QUEUE_NAME}
# ${CLEARML_QUEUE_ID}
# ${CLEARML_WORKER_ID}.
# complex template variables (default value separator ":")
# ${CLEARML_TASK.id}
# ${CLEARML_TASK.name}
# ${CLEARML_TASK.project.id}
# ${CLEARML_TASK.hyperparams.properties.user_key.value}
# example
#SBATCH --job-name=clearml_task_${CLEARML_TASK.id} # Job name DO NOT CHANGE
#SBATCH --ntasks=1 # Run on a single CPU
# #SBATCH --mem=1mb # Job memory request
# #SBATCH --time=00:05:00 # Time limit hrs:min:sec
#SBATCH --output=task-${CLEARML_TASK.id}-%j.log
#SBATCH --partition debug
#SBATCH --cpus-per-task=1
#SBATCH --priority=5
#SBATCH --nodes=${CLEARML_TASK.hyperparams.properties.num_nodes.value:1}
${CLEARML_PRE_SETUP}
echo whoami $(whoami)
${CLEARML_AGENT_EXECUTE}
${CLEARML_POST_SETUP}
```
Notice: If you are using Slurm with Singularity container support replace `${CLEARML_AGENT_EXECUTE}` in the batch
template with `singularity exec ${CLEARML_AGENT_EXECUTE}`. For additional required settings, see [Slurm with Singularity](#slurm-with-singularity).
:::tip
You can override the default values of a Slurm job template via the ClearML Web UI. The following command in the
template sets the `nodes` value to be the ClearML Tasks `num_nodes` user property:
```
#SBATCH --nodes=${CLEARML_TASK.hyperparams.properties.num_nodes.value:1}
```
This user property can be modified in the UI, in the task's **CONFIGURATION > User Properties** section, and when the
task is executed the new modified value will be used.
:::
3. Launch the ClearML Agent Slurm Glue and assign the Slurm configuration to a ClearML queue. For example, the following
associates the `default` queue to the `slurm.example.template` script, so any jobs pushed to this queue will use the
resources set by that script.
```
clearml-agent-slurm --template-files slurm.example.template --queue default
```
You can also pass multiple templates and queues. For example:
```
clearml-agent-slurm --template-files slurm.template1 slurm.template2 --queue queue1 queue2
```
### Slurm with Singularity
If you are running Slurm with Singularity containers support, set the following:
1. Make sure your `sbatch` template contains:
```
singularity exec ${CLEARML_AGENT_EXECUTE}
```
Additional singularity arguments can be added, for example:
```
singularity exec --uts ${CLEARML_AGENT_EXECUTE}`
```
1. Set the default Singularity container to use in your [clearml.conf](../configs/clearml_conf.md) file:
```
agent.default_docker.image="shub://repo/hello-world"
```
Or
```
agent.default_docker.image="docker://ubuntu"
```
1. Add `--singularity-mode` to the command line, for example:
```
clearml-agent-slurm --singularity-mode --template-files slurm.example_singularity.template --queue default
```
## Google Colab
ClearML Agent can run on a [Google Colab](https://colab.research.google.com/) instance. This helps users to leverage
compute resources provided by Google Colab and send tasks for execution on it.
Check out [this tutorial](../guides/ide/google_colab.md) on how to run a ClearML Agent on Google Colab!
## Explicit Task Execution
ClearML Agent can also execute specific tasks directly, without listening to a queue.
### Execute a Task without Queue
Execute a Task with a `clearml-agent` worker without a queue.
```bash
clearml-agent execute --id <task-id>
```
### Clone a Task and Execute the Cloned Task
Clone the specified Task and execute the cloned Task with a `clearml-agent` worker without a queue.
```bash
clearml-agent execute --id <task-id> --clone
```
### Execute Task inside a Docker
Execute a Task with a `clearml-agent` worker using a Docker container without a queue.
```bash
clearml-agent execute --id <task-id> --docker
```
## Debugging
Run a `clearml-agent` daemon in foreground mode, sending all output to the console.
```bash
clearml-agent daemon --queue default --foreground
```

View File

@@ -0,0 +1,136 @@
---
title: Manual Deployment
---
## Spinning Up an Agent
You can spin up an agent on any machine: on-prem and/or cloud instance. When spinning up an agent, you assign it to
service a queue(s). Utilize the machine by enqueuing tasks to the queue that the agent is servicing, and the agent will
pull and execute the tasks.
:::tip cross-platform execution
ClearML Agent is platform-agnostic. When using the ClearML Agent to execute tasks cross-platform, set platform
specific environment variables before launching the agent.
For example, to run an agent on an ARM device, set the core type environment variable before spinning up the agent:
```bash
export OPENBLAS_CORETYPE=ARMV8
clearml-agent daemon --queue <queue_name>
```
:::
### Executing an Agent
To execute an agent, listening to a queue, run:
```bash
clearml-agent daemon --queue <queue_name>
```
### Executing in Background
To execute an agent in the background, run:
```bash
clearml-agent daemon --queue <execution_queue_to_pull_from> --detached
```
### Stopping Agents
To stop an agent running in the background, run:
```bash
clearml-agent daemon <arguments> --stop
```
### Allocating Resources
To specify GPUs associated with the agent, add the `--gpus` flag.
:::info Docker Mode
Make sure to include the `--docker` flag, as GPU management through the agent is only supported in [Docker Mode](clearml_agent_execution_env.md#docker-mode).
:::
To execute multiple agents on the same machine (usually assigning GPU for the different agents), run:
```bash
clearml-agent daemon --gpus 0 --queue default --docker
clearml-agent daemon --gpus 1 --queue default --docker
```
To allocate more than one GPU, provide a list of allocated GPUs
```bash
clearml-agent daemon --gpus 0,1 --queue dual_gpu --docker
```
### Queue Prioritization
A single agent can listen to multiple queues. The priority is set by their order.
```bash
clearml-agent daemon --queue high_q low_q
```
This ensures the agent first tries to pull a Task from the `high_q` queue, and only if it is empty, the agent will try to pull
from the `low_q` queue.
To make sure an agent pulls from all queues equally, add the `--order-fairness` flag.
```bash
clearml-agent daemon --queue group_a group_b --order-fairness
```
It will make sure the agent will pull from the `group_a` queue, then from `group_b`, then back to `group_a`, etc. This ensures
that `group_a` or `group_b` will not be able to starve one another of resources.
### SSH Access
By default, ClearML Agent maps the host's `~/.ssh` into the container's `/root/.ssh` directory (configurable,
see [clearml.conf](../configs/clearml_conf.md#docker_internal_mounts)).
If you want to use existing auth sockets with ssh-agent, you can verify your host ssh-agent is working correctly with:
```commandline
echo $SSH_AUTH_SOCK
```
You should see a path to a temporary file, something like this:
```console
/tmp/ssh-<random>/agent.<random>
```
Then run your `clearml-agent` in Docker mode, which will automatically detect the `SSH_AUTH_SOCK` environment variable,
and mount the socket into any container it spins.
You can also explicitly set the `SSH_AUTH_SOCK` environment variable when executing an agent. The command below will
execute an agent in Docker mode and assign it to service a queue. The agent will have access to
the SSH socket provided in the environment variable.
```
SSH_AUTH_SOCK=<file_socket> clearml-agent daemon --gpus <your config> --queue <your queue name> --docker
```
## Google Colab
ClearML Agent can run on a [Google Colab](https://colab.research.google.com/) instance. This helps users to leverage
compute resources provided by Google Colab and send tasks for execution on it.
Check out [this tutorial](../guides/ide/google_colab.md) on how to run a ClearML Agent on Google Colab!
## Explicit Task Execution
ClearML Agent can also execute specific tasks directly, without listening to a queue.
### Execute a Task without Queue
Execute a Task with a `clearml-agent` worker without a queue.
```bash
clearml-agent execute --id <task-id>
```
### Clone a Task and Execute the Cloned Task
Clone the specified Task and execute the cloned Task with a `clearml-agent` worker without a queue.
```bash
clearml-agent execute --id <task-id> --clone
```
### Execute Task inside a Docker
Execute a Task with a `clearml-agent` worker using a Docker container without a queue.
```bash
clearml-agent execute --id <task-id> --docker
```
## Debugging
Run a `clearml-agent` daemon in foreground mode, sending all output to the console.
```bash
clearml-agent daemon --queue default --foreground
```

View File

@@ -0,0 +1,51 @@
---
title: Kubernetes
---
Agents can be deployed bare-metal or as Docker containers in a Kubernetes cluster. ClearML Agent adds missing scheduling capabilities to Kubernetes, enabling more flexible automation from code while leveraging all of ClearML Agent's features.
ClearML Agent is deployed onto a Kubernetes cluster using **Kubernetes-Glue**, which maps ClearML jobs directly to Kubernetes jobs. This allows seamless task execution and resource allocation across your cluster.
## Deployment Options
You can deploy ClearML Agent onto Kubernetes using one of the following methods:
1. **ClearML Agent Helm Chart (Recommended)**:
Use the [ClearML Agent Helm Chart](https://github.com/clearml/clearml-helm-charts/tree/main/charts/clearml-agent) to spin up an agent pod acting as a controller. This is the recommended and scalable approach.
2. **K8s Glue Script**:
Run a [K8s Glue script](https://github.com/clearml/clearml-agent/blob/master/examples/k8s_glue_example.py) on a Kubernetes CPU node. This approach is less scalable and typically suited for simpler use cases.
## How It Works
The ClearML Kubernetes-Glue performs the following:
- Pulls jobs from the ClearML execution queue.
- Prepares a Kubernetes job based on a provided YAML template.
- Inside each job pod, the `clearml-agent`:
- Installs the required environment for the task.
- Executes and monitors the task process.
:::important Enterprise Features
ClearML Enterprise adds advanced Kubernetes features:
- **Multi-Queue Support**: Service multiple ClearML queues within the same Kubernetes cluster.
- **Pod-Specific Templates**: Define resource configurations per queue using pod templates.
For example, you can configure resources for different queues as shown below:
```yaml
agentk8sglue:
queues:
example_queue_1:
templateOverrides:
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-40GB-MIG-1g.5gb
resources:
limits:
nvidia.com/gpu: 1
example_queue_2:
templateOverrides:
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-40GB
resources:
limits:
nvidia.com/gpu: 2
```
:::

View File

@@ -0,0 +1,107 @@
---
title: Slurm
---
:::important Enterprise Feature
Slurm Glue is available under the ClearML Enterprise plan.
:::
Agents can be deployed bare-metal or inside [`Singularity`](https://docs.sylabs.io/guides/3.5/user-guide/introduction.html)
containers in Linux clusters managed with Slurm.
ClearML Agent Slurm Glue maps jobs to Slurm batch scripts: associate a ClearML queue to a batch script template, then
when a Task is pushed into the queue, it will be converted and executed as an `sbatch` job according to the sbatch
template specification attached to the queue.
1. Install the Slurm Glue on a machine where you can run `sbatch` / `squeue` etc.
```
pip3 install -U --extra-index-url https://*****@*****.allegro.ai/repository/clearml_agent_slurm/simple clearml-agent-slurm
```
1. Create a batch template. Make sure to set the `SBATCH` variables to the resources you want to attach to the queue.
The script below sets up an agent to run bare-metal, creating a virtual environment per job. For example:
```
#!/bin/bash
# available template variables (default value separator ":")
# ${CLEARML_QUEUE_NAME}
# ${CLEARML_QUEUE_ID}
# ${CLEARML_WORKER_ID}.
# complex template variables (default value separator ":")
# ${CLEARML_TASK.id}
# ${CLEARML_TASK.name}
# ${CLEARML_TASK.project.id}
# ${CLEARML_TASK.hyperparams.properties.user_key.value}
# example
#SBATCH --job-name=clearml_task_${CLEARML_TASK.id} # Job name DO NOT CHANGE
#SBATCH --ntasks=1 # Run on a single CPU
# #SBATCH --mem=1mb # Job memory request
# #SBATCH --time=00:05:00 # Time limit hrs:min:sec
#SBATCH --output=task-${CLEARML_TASK.id}-%j.log
#SBATCH --partition debug
#SBATCH --cpus-per-task=1
#SBATCH --priority=5
#SBATCH --nodes=${CLEARML_TASK.hyperparams.properties.num_nodes.value:1}
${CLEARML_PRE_SETUP}
echo whoami $(whoami)
${CLEARML_AGENT_EXECUTE}
${CLEARML_POST_SETUP}
```
Notice: If you are using Slurm with Singularity container support replace `${CLEARML_AGENT_EXECUTE}` in the batch
template with `singularity exec ${CLEARML_AGENT_EXECUTE}`. For additional required settings, see [Slurm with Singularity](#slurm-with-singularity).
:::tip
You can override the default values of a Slurm job template via the ClearML Web UI. The following command in the
template sets the `nodes` value to be the ClearML Tasks `num_nodes` user property:
```
#SBATCH --nodes=${CLEARML_TASK.hyperparams.properties.num_nodes.value:1}
```
This user property can be modified in the UI, in the task's **CONFIGURATION > User Properties** section, and when the
task is executed the new modified value will be used.
:::
3. Launch the ClearML Agent Slurm Glue and assign the Slurm configuration to a ClearML queue. For example, the following
associates the `default` queue to the `slurm.example.template` script, so any jobs pushed to this queue will use the
resources set by that script.
```
clearml-agent-slurm --template-files slurm.example.template --queue default
```
You can also pass multiple templates and queues. For example:
```
clearml-agent-slurm --template-files slurm.template1 slurm.template2 --queue queue1 queue2
```
## Slurm with Singularity
If you are running Slurm with Singularity containers support, set the following:
1. Make sure your `sbatch` template contains:
```
singularity exec ${CLEARML_AGENT_EXECUTE}
```
Additional singularity arguments can be added, for example:
```
singularity exec --uts ${CLEARML_AGENT_EXECUTE}`
```
1. Set the default Singularity container to use in your [clearml.conf](../configs/clearml_conf.md) file:
```
agent.default_docker.image="shub://repo/hello-world"
```
Or
```
agent.default_docker.image="docker://ubuntu"
```
1. Add `--singularity-mode` to the command line, for example:
```
clearml-agent-slurm --singularity-mode --template-files slurm.example_singularity.template --queue default
```

View File

@@ -21,7 +21,7 @@ clearml-agent daemon --dynamic-gpus --gpus 0-2 --queue dual_gpus=2 single_gpu=1
Make sure to include the `--docker` flag, as dynamic GPU allocation is only supported in [Docker Mode](clearml_agent_execution_env.md#docker-mode).
:::
## Example
#### Example
Let's say a server has three queues:
* `dual_gpu`

View File

@@ -9,7 +9,7 @@ If ClearML was previously configured, follow [this](#adding-clearml-agent-to-a-c
ClearML Agent specific configurations
:::
To install ClearML Agent, execute
To install [ClearML Agent](../clearml_agent.md), execute
```bash
pip install clearml-agent
```
@@ -27,7 +27,7 @@ it can't do that when running from a virtual environment.
clearml-agent init
```
The setup wizard prompts for ClearML credentials (see [here](../webapp/settings/webapp_settings_profile.md#clearml-credentials) about obtaining credentials).
The setup wizard prompts for ClearML credentials (see [here](../webapp/settings/webapp_settings_profile.md#clearml-api-credentials) about obtaining credentials).
```
Please create new clearml credentials through the settings page in your `clearml-server` web app,
or create a free account at https://app.clear.ml/settings/webapp-configuration

View File

@@ -37,7 +37,7 @@ lineage and content information. See [dataset UI](../webapp/datasets/webapp_data
## Setup
`clearml-data` comes built-in with the `clearml` Python package! Check out the [Getting Started](../getting_started/ds/ds_first_steps.md)
`clearml-data` comes built-in with the `clearml` Python package! Check out the [ClearML Setup](../clearml_sdk/clearml_sdk_setup)
guide for more info!
## Using ClearML Data
@@ -46,7 +46,7 @@ ClearML Data supports two interfaces:
- `clearml-data` - A CLI utility for creating, uploading, and managing datasets. See [CLI](clearml_data_cli.md) for a reference of `clearml-data` commands.
- `clearml.Dataset` - A Python interface for creating, retrieving, managing, and using datasets. See [SDK](clearml_data_sdk.md) for an overview of the basic methods of the `Dataset` module.
For an overview of recommendations for ClearML Data workflows and practices, see [Best Practices](best_practices.md).
For an overview of recommendations for ClearML Data workflows and practices, see [Best Practices](../best_practices/data_best_practices.md).
## Dataset Version States
The following table displays the possible states for a dataset version.

View File

@@ -7,7 +7,7 @@ tasks for you, and an extensive set of powerful features and functionality you c
and other workflows.
:::tip Installation
For installation instructions, see [Getting Started](../getting_started/ds/ds_first_steps.md#install-clearml).
For installation instructions, see [ClearML Setup](../clearml_sdk/clearml_sdk_setup#install-clearml).
:::
The ClearML Python Package collects the scripts' entire execution information, including:

View File

@@ -1,7 +1,9 @@
---
title: First Steps
title: ClearML Python Package
---
This is step-by-step guide for installing the `clearml` Python package and connecting it to the ClearML Server. Once done,
you can integrate `clearml` into your code.
## Install ClearML
@@ -44,8 +46,8 @@ pip install clearml
CLEARML_CONFIG_FILE = MyOtherClearML.conf
```
For more information about running experiments inside Docker containers, see [ClearML Agent Deployment](../../clearml_agent/clearml_agent_deployment.md)
and [ClearML Agent Reference](../../clearml_agent/clearml_agent_ref.md).
For more information about running tasks inside Docker containers, see [ClearML Agent Deployment](../clearml_agent/clearml_agent_deployment_bare_metal.md)
and [ClearML Agent Reference](../clearml_agent/clearml_agent_ref.md).
</Collapsible>
@@ -83,7 +85,7 @@ pip install clearml
CLEARML setup completed successfully.
```
Now you can integrate ClearML into your code! Continue [here](#auto-log-experiment).
Now you can integrate ClearML into your code! Continue [here](../getting_started/auto_log_exp.md).
### Jupyter Notebook
To use ClearML with Jupyter Notebook, you need to configure ClearML Server access credentials for your notebook.
@@ -94,49 +96,3 @@ To use ClearML with Jupyter Notebook, you need to configure ClearML Server acces
1. Add these commands to your notebook
Now you can use ClearML in your notebook!
## Auto-log Experiment
In ClearML, experiments are organized as [Tasks](../../fundamentals/task.md).
ClearML automatically logs your task and code, including outputs and parameters from popular ML frameworks,
once you integrate the ClearML [SDK](../../clearml_sdk/clearml_sdk.md) with your code. To control what ClearML automatically logs, see this [FAQ](../../faq.md#controlling_logging).
At the beginning of your code, import the `clearml` package:
```python
from clearml import Task
```
:::tip Full Automatic Logging
To ensure full automatic logging, it is recommended to import the `clearml` package at the top of your entry script.
:::
Then initialize the Task object in your `main()` function, or the beginning of the script.
```python
task = Task.init(project_name='great project', task_name='best task')
```
If the project does not already exist, a new one is created automatically.
The console should display the following output:
```
ClearML Task: created new task id=1ca59ef1f86d44bd81cb517d529d9e5a
2021-07-25 13:59:09
ClearML results page: https://app.clear.ml/projects/4043a1657f374e9298649c6ba72ad233/experiments/1ca59ef1f86d44bd81cb517d529d9e5a/output/log
2021-07-25 13:59:16
```
**That's it!** You are done integrating ClearML with your code :)
Now, [command-line arguments](../../fundamentals/hyperparameters.md#tracking-hyperparameters), [console output](../../fundamentals/logger.md#types-of-logged-results) as well as Tensorboard and Matplotlib will automatically be logged in the UI under the created Task.
Sit back, relax, and watch your models converge :) or continue to see what else can be done with ClearML [here](ds_second_steps.md).
## YouTube Playlist
Or watch the **Getting Started** playlist on ClearML's YouTube Channel!
[![Watch the video](https://img.youtube.com/vi/bjWwZAzDxTY/hqdefault.jpg)](https://www.youtube.com/watch?v=bjWwZAzDxTY&list=PLMdIlCuMqSTnoC45ME5_JnsJX0zWqDdlO&index=2)

View File

@@ -2,16 +2,8 @@
title: Hyperparameter Optimization
---
## What is Hyperparameter Optimization?
Hyperparameters are variables that directly control the behaviors of training algorithms, and have a significant effect on
the performance of the resulting machine learning models. Finding the hyperparameter values that yield the best
performing models can be complicated. Manually adjusting hyperparameters over the course of many training trials can be
slow and tedious. Luckily, you can automate and boost hyperparameter optimization (HPO) with ClearML's
[**`HyperParameterOptimizer`**](../references/sdk/hpo_optimization_hyperparameteroptimizer.md) class.
## ClearML's Hyperparameter Optimization
ClearML provides the `HyperParameterOptimizer` class, which takes care of the entire optimization process for users
You can automate and boost hyperparameter optimization (HPO) with ClearML's
[**`HyperParameterOptimizer`**](../references/sdk/hpo_optimization_hyperparameteroptimizer.md) class, which takes care of the entire optimization process
with a simple interface.
ClearML's approach to hyperparameter optimization is scalable, easy to set up and to manage, and it makes it easy to
@@ -57,11 +49,11 @@ optimization.
documentation.
* **BOHB** - [`automation.hpbandster.OptimizerBOHB`](../references/sdk/hpo_hpbandster_bandster_optimizerbohb.md). BOHB performs robust and efficient hyperparameter optimization
at scale by combining the speed of Hyperband searches with the guidance and guarantees of convergence of Bayesian Optimization.
For more information about HpBandSter BOHB, see the [HpBandSter](https://automl.github.io/HpBandSter/build/html/index.html)
For more information about HpBandSter BOHB, see the [HpBandSter](../https://automl.github.io/HpBandSter/build/html/index.html)
documentation and a [code example](../guides/frameworks/pytorch/notebooks/image/hyperparameter_search.md).
* **Random** uniform sampling of hyperparameters - [`automation.RandomSearch`](../references/sdk/hpo_optimization_randomsearch.md).
* **Full grid** sampling strategy of every hyperparameter combination - [`automation.GridSearch`](../references/sdk/hpo_optimization_gridsearch.md).
* **Custom** - [`automation.optimization.SearchStrategy`](https://github.com/clearml/clearml/blob/master/clearml/automation/optimization.py#L268) - Use a custom class and inherit from the ClearML automation base strategy class.
* **Custom** - [`automation.optimization.SearchStrategy`](../https://github.com/clearml/clearml/blob/master/clearml/automation/optimization.py#L268) - Use a custom class and inherit from the ClearML automation base strategy class.
## Defining a Hyperparameter Optimization Search Example
@@ -137,9 +129,9 @@ optimization.
## Optimizer Execution Options
The `HyperParameterOptimizer` provides options to launch the optimization tasks locally or through a ClearML [queue](agents_and_queues.md#what-is-a-queue).
The `HyperParameterOptimizer` provides options to launch the optimization tasks locally or through a ClearML [queue](../fundamentals/agents_and_queues.md#what-is-a-queue).
Start a `HyperParameterOptimizer` instance using either [`HyperParameterOptimizer.start()`](../references/sdk/hpo_optimization_hyperparameteroptimizer.md#start)
or [`HyperParameterOptimizer.start_locally()`](../references/sdk/hpo_optimization_hyperparameteroptimizer.md#start_locally).
or [`HyperParameterOptimizer.start_locally()`](references/sdk/hpo_optimization_hyperparameteroptimizer.md#start_locally).
Both methods run the optimizer controller locally. `start()` launches the base task clones through a queue
specified when instantiating the controller, while `start_locally()` runs the tasks locally.
@@ -156,17 +148,3 @@ Check out the [Hyperparameter Optimization tutorial](../guides/optimization/hype
## SDK Reference
For detailed information, see the complete [HyperParameterOptimizer SDK reference page](../references/sdk/hpo_optimization_hyperparameteroptimizer.md).
## CLI
ClearML also provides `clearml-param-search`, a CLI utility for managing the hyperparameter optimization process. See
[ClearML Param Search](../apps/clearml_param_search.md) for more information.
## UI Application
:::info Pro Plan Offering
The ClearML HPO App is available under the ClearML Pro plan.
:::
ClearML provides the [Hyperparameter Optimization GUI application](../webapp/applications/apps_hpo.md) for launching and
managing the hyperparameter optimization process.

View File

@@ -13,7 +13,7 @@ The following page goes over how to set up and upgrade `clearml-serving`.
## Initial Setup
1. Set up your [ClearML Server](../deploying_clearml/clearml_server.md) or use the
[free hosted service](https://app.clear.ml)
1. Connect `clearml` SDK to the server, see instructions [here](../getting_started/ds/ds_first_steps.md#install-clearml)
1. Connect `clearml` SDK to the server, see instructions [here](../clearml_sdk/clearml_sdk_setup#install-clearml)
1. Install clearml-serving CLI:

View File

@@ -71,7 +71,7 @@ execute the tasks in the GPU queue.
#### Docker
Every task a cloud instance pulls will be run inside a docker container. When setting up an autoscaler app instance,
you can specify a default container to run the tasks inside. If the task has its own container configured, it will
override the autoscalers default docker image (see [Base Docker Image](../clearml_agent/clearml_agent_docker.md#base-docker-container)).
override the autoscalers default docker image (see [Base Container](../getting_started/clearml_agent_base_docker.md#base-container)).
#### Git Configuration
If your code is saved in a private repository, you can add your Git credentials so the ClearML Agents running on your

24
docs/custom_apps.md Normal file
View File

@@ -0,0 +1,24 @@
---
title: Custom Applications
---
:::info Enterprise Feature
The custom applications are available under the ClearML Enterprise plan.
:::
ClearML supports creating your own GUI applications for deploying GenAI apps into your Enterprise environment.
Instantly spin up apps with customized dashboards for internal customers, enabling seamless model testing, interactive
demos, automated workflow and more.
## Why Use Custom Applications?
Custom Applications provide:
* Instant Deployment: Launch interactive applications directly within your Enterprise environment
* Tailored UI: Customize forms and dashboards for monitoring processes
* Automated Execution: Run AI workflows with structured inputs and repeatable processes
* Accessible: Enable non-technical users to interact with models through GUI interfaces
* Seamless Integration: Connect with ClearML's ecosystem for task tracking and visualization
See [Custom Application Setup](deploying_clearml/enterprise_deploy/app_custom.md) for instructions on creating and
deploying custom ClearML applications.

View File

@@ -4,14 +4,14 @@ title: ClearML Server
## What is ClearML Server?
The ClearML Server is the backend service infrastructure for ClearML. It allows multiple users to collaborate and
manage their experiments by working seamlessly with the ClearML Python package and [ClearML Agent](../clearml_agent.md).
manage their tasks by working seamlessly with the ClearML Python package and [ClearML Agent](../clearml_agent.md).
ClearML Server is composed of the following:
* Web server including the [ClearML Web UI](../webapp/webapp_overview.md), which is the user interface for tracking, comparing, and managing experiments.
* Web server including the [ClearML Web UI](../webapp/webapp_overview.md), which is the user interface for tracking, comparing, and managing tasks.
* API server which is a RESTful API for:
* Documenting and logging experiments, including information, statistics, and results.
* Querying experiments history, logs, and results.
* Documenting and logging tasks, including information, statistics, and results.
* Querying task history, logs, and results.
* File server which stores media and models making them easily accessible using the ClearML Web UI.
@@ -23,9 +23,9 @@ The ClearML Web UI is the ClearML user interface and is part of ClearML Server.
Use the ClearML Web UI to:
* Track experiments
* Compare experiments
* Manage experiments
* Track tasks
* Compare tasks
* Manage tasks
For detailed information about the ClearML Web UI, see [User Interface](../webapp/webapp_overview.md).
@@ -49,7 +49,7 @@ authentication, subdomains, and load balancers, and use any of its many configur
1. Optionally, [configure ClearML Server](clearml_server_config.md) for additional features, including subdomains and load balancers,
web login authentication, and the non-responsive task watchdog.
1. [Connect the ClearML SDK to the ClearML Server](../getting_started/ds/ds_first_steps.md)
1. [Connect the ClearML SDK to the ClearML Server](../clearml_sdk/clearml_sdk_setup)
## Updating

View File

@@ -150,4 +150,4 @@ The following section contains a list of AMI Image IDs per-region for the latest
## Next Step
To keep track of your experiments and/or data, the `clearml` package needs to communicate with your server.
For instruction to connect the ClearML SDK to the server, see [Getting Started: First Steps](../getting_started/ds/ds_first_steps.md).
For instruction to connect the ClearML SDK to the server, see [ClearML Setup](../clearml_sdk/clearml_sdk_setup).

View File

@@ -12,7 +12,7 @@ This page describes the ClearML Server [deployment](#clearml-server-deployment-c
* [Opening Elasticsearch, MongoDB, and Redis for External Access](#opening-elasticsearch-mongodb-and-redis-for-external-access)
* [Web login authentication](#web-login-authentication) - Create and manage users and passwords
* [Using hashed passwords](#using-hashed-passwords) - Option to use hashed passwords instead of plain-text passwords
* [Non-responsive Task watchdog](#non-responsive-task-watchdog) - For inactive experiments
* [Non-responsive Task watchdog](#non-responsive-task-watchdog) - For inactive tasks
* [Custom UI context menu actions](#custom-ui-context-menu-actions)
For all configuration options, see the [ClearML Configuration Reference](../configs/clearml_conf.md) page.
@@ -361,7 +361,7 @@ You can also use hashed passwords instead of plain-text passwords. To do that:
### Non-responsive Task Watchdog
The non-responsive experiment watchdog monitors experiments that were not updated for a specified time interval, and then
The non-responsive task watchdog monitors tasks that were not updated for a specified time interval, and then
the watchdog marks them as `aborted`. The non-responsive experiment watchdog is always active.
Modify the following settings for the watchdog:
@@ -391,7 +391,7 @@ Modify the following settings for the watchdog:
```
:::tip
If the `apiserver.conf` file does not exist, create your own in ClearML Server's `/opt/clearml/config` directory (or
If the `services.conf` file does not exist, create your own in ClearML Server's `/opt/clearml/config` directory (or
an alternate folder you configured), and input the modified configuration
:::
@@ -464,8 +464,8 @@ an alternate folder you configured), and input the modified configuration
:::
The action will appear in the context menu for the object type in which it was specified:
* Task, model, dataview - Right-click an object in the [experiments](../webapp/webapp_exp_table.md), [models](../webapp/webapp_model_table.md),
and [dataviews](../hyperdatasets/webapp/webapp_dataviews.md) tables respectively. Alternatively, click the object to
* Task, model, dataview - Right-click an object in the [task](../webapp/webapp_exp_table.md), [model](../webapp/webapp_model_table.md),
and [dataview](../hyperdatasets/webapp/webapp_dataviews.md) tables respectively. Alternatively, click the object to
open its info tab, then click the menu button <img src="/docs/latest/icons/ico-bars-menu.svg" className="icon size-md space-sm" />
to access the context menu.
* Project - In the project page > click the menu button <img src="/docs/latest/icons/ico-bars-menu.svg" className="icon size-md space-sm" />

View File

@@ -7,7 +7,7 @@ provides custom images for each released version of ClearML Server. For a list o
[ClearML Server GCP Custom Image](#clearml-server-gcp-custom-image).
To keep track of your experiments and/or data, the `clearml` package needs to communicate with the server you have deployed.
For instruction to connect the ClearML SDK to the server, see [Getting Started: First Steps](../getting_started/ds/ds_first_steps.md).
For instruction to connect the ClearML SDK to the server, see [ClearML Setup](../clearml_sdk/clearml_sdk_setup).
:::info
In order for `clearml` to work with a ClearML Server on GCP, set `CLEARML_API_DEFAULT_REQ_METHOD=PUT` or
@@ -155,4 +155,4 @@ The following section contains a list of Custom Image URLs (exported in differen
## Next Step
To keep track of your experiments and/or data, the `clearml` package needs to communicate with your server.
For instruction to connect the ClearML SDK to the server, see [Getting Started: First Steps](../getting_started/ds/ds_first_steps.md).
For instruction to connect the ClearML SDK to the server, see [ClearML Setup](../clearml_sdk/clearml_sdk_setup).

View File

@@ -32,4 +32,4 @@ instructions in the [Security](clearml_server_security.md) page.
## Next Step
To keep track of your experiments and/or data, the `clearml` package needs to communicate with your server.
For instruction to connect the ClearML SDK to the server, see [Getting Started: First Steps](../getting_started/ds/ds_first_steps.md).
For instruction to connect the ClearML SDK to the server, see [ClearML Setup](../clearml_sdk/clearml_sdk_setup).

View File

@@ -227,4 +227,4 @@ If needed, restore data and configuration by doing the following:
## Next Step
To keep track of your experiments and/or data, the `clearml` package needs to communicate with your server.
For instruction to connect the ClearML SDK to the server, see [Getting Started: First Steps](../getting_started/ds/ds_first_steps.md).
For instruction to connect the ClearML SDK to the server, see [ClearML Setup](../clearml_sdk/clearml_sdk_setup).

View File

@@ -89,4 +89,4 @@ After deploying ClearML Server, the services expose the following node ports:
## Next Step
To keep track of your experiments and/or data, the `clearml` package needs to communicate with your server.
For instruction to connect the ClearML SDK to the server, see [Getting Started: First Steps](../getting_started/ds/ds_first_steps.md).
For instruction to connect the ClearML SDK to the server, see [ClearML Setup](../clearml_sdk/clearml_sdk_setup).

View File

@@ -0,0 +1,479 @@
---
title: Custom Applications
---
The following is a guide for creating and installing custom ClearML applications on ClearML on-premises Enterprise servers.
ClearML applications are Python programs that are run as ClearML tasks whose UI--input form and output dashboard--is
defined in an attached configuration file.
This guide will follow the `simple-app` application as an example. The application can be found on [GitHub](https://github.com/clearml/clearml-apps/tree/main/demo_apps/simple-app).
An application will generally consist of the following:
* Configuration file: File that describes the content of the application, such as:
* The task to run and from where to run it
* The structure of the input form for launching an application instance
* The information to display in the application instances dashboard.
* Assets: Optional images and artifacts for the application, such as icons and HTML placeholders.
* Task: Python code that is run when the application is launched. Should be in a Git repository.
## Configuration File
The configuration file describes the application. The file is a hocon file, typically named: `<app-name>.app.conf`. It
contains the following sections:
* General: The root section, describing the applications general information such as name, ID, version, icon, and queue
* Task: Information about the task to execute, such as repository info and hyperparameters
* Wizard: Fields for the application instance launch form, and where to store the input provided by the user
* Dashboard: Information section displayed for the running application instances
### General
The `General` section is the root-level section of the configuration file, and contains the configuration options:
* `id` - A unique id for the application
* `name` - The name to display in the web application
* `version` - The version of the application implementation. Recommended to have three numbers and to bump up when updating applications, so that older running instances can still be displayed
* `provider` - The person/team/group who is the owner of the application. This will appears in the UI
* `description` - Short description of the application to be displayed in the ClearML Web UI
* `icon` (*Optional*) - Small image to display in the ClearML web UI as an icon for the application. Can be a public web url or an image in the applications assets directory (described below)
* `no_info_html` (*Optional*) - HTML content to display as a placeholder for the dashboard when no instance is available. Can be a public web url or a file in the applications assets directory (described below)
* `default-queue` - The queue to which application instance will be sent when launching a new instance. This queue should have an appropriate agent servicing it. See details in the Custom Apps Agent section below.
* `badges` (*Optional*) - List of strings to display as a bacge/label in the UI
* `resumable` - Boolean indication whether a running application instance can be restarted if required. Default is false.
* `category` (*Optional*) - Way to separate apps into different tabs in the ClearML web UI
* `featured` (*Optional*) - Value affecting the order of applications. Lower values are displayed first. Defaults to 500
#### Example
The root section in the simple application example:
```
id: "simple-app"
version: "1.0.0"
name: "Simple example application"
provider: "ClearML"
description: "A simple example of an application"
icon: "${ASSET:app-simple-app@2x.png}"
badges: []
details_page: "task"
no_info_html: "${ASSET:index.html}"
default_queue: "custom_apps_queue"
```
### Task
The `task` section describes the task to run, containing the following fields:
* `script` - Contains information about what task code to run:
* `repository` - The git repository. Note that credentials must be described in the Custom Apps Agents configuration. See details below.
* `branch` - The branch to use
* `entry_point` - The python file to run
* `working_dir` - The directory to run it from
* `hyperparams` (*Optional*) - A list of the tasks hyper-parameters used by the application, with their default values. There is no need to specify all the parameters here, but it enables summarizing of the parameters that will be targeted by the wizard entries described below, and allows to specify default values to optional parameters appearing in the wizard.
#### Example
The `task` section in the simple application example:
```
task {
script {
repository: "https://bitbucket.org/seematics/clearml_apps.git"
entry_point: "main.py"
working_dir: "demo_apps/simple-app"
branch: "master"
}
hyperparams {
General {
a_number: 30.0
a_string: "testing 1, 2, 3"
a_boolean: False
a_project_id: ""
},
}
}
```
### Wizard
The `wizard` section defines the entries to display in the application instances UI launch form. Each entry may contain the following fields:
* `name` - Field name
* `title` - Title to display in the wizard above the field
* `info` - Optional information hint to the user
* `type` - Can be one of the following:
* Basic types:
* `string`
* `integer`
* `float`
* `dropdown`
* `checkbox`
* `multiline_text`
* Complex types:
* `group` - Fields grouped together in a joint section. Fields of the group are defined within a list called
`item_template`
* `list` - A field or group of fields that can be inserted more than once. Target should be specified for the entire
list. Fields of the list are defined within a list called `item_template`
* `required` - Boolean indication whether the user must fill the field. Default is `false`
* `default` - Default value for the field
* `placeholder` - Text to show in the field before typing
* `collapsible` - Boolean indicates if the group can be collapsed. Default is `false`
* `collapsibleTitleTemplate` - Optional title for collapsible fields. You can use `${field name}` to reference a field.
Useful for lists.
* `conditional` - Allows setting a condition for the displaying of a field. Specify a list of entries, each containing
the name of a field that appears earlier and its expected value. The field will be displayed only if the referenced
previous fields were filled with the matching value. See example below.
* `default_conditional_on` - allows setting a field whose default value depends on the value of a previous field in the wizard.
Need to specify the `name` of the previous field and a `value` dictionary, in which each key is a potential value of the previous field and each value is the default value for the default_conditional_field.
* `choices` - for dropdown - Can be either an array of hard-coded options, for example: `["Option 1","Option 2"]`, or a ClearML object, such as task, project, queue to choose from. The following should be specified:
* `source` - The source object. One of following:
* `project`
* `task`
* `model`
* `queue`
* `dataset_version`
* `display_field` - The field of the source object to display in the list. Usually “name”
* `value_field` - The field of the source object to use for configuring the app instance. Usually “id”
* `filter` - Allows to limit the choices list by setting a filter on one or more of the objects fields. See Project Selection example below
* `target` - Where in the application instances task the values will be set. Contains the following:
* `field` - Either `configuration` or `hyperparams`
* `section` - For hyperparams - the section within the field
* `name` - Key in which to store the value
* `format` - The format of the value to store. `str` By default. Use `json` for lists.
* `item_template` - list of items for `group` or for `list` fields.
#### Example
The example is based on the `simple-app` application `wizard` section:
* Wizard Section:
```
wizard {
entries: [
]
}
```
* Boolean Field: A simple boolean field stored in the `General` hyperparameters section:
```
{
name: boolean_field
title: A boolean choice
default: false
type: checkbox
required: false
target {
field: hyperparams
section: General
name: a_bool
}
}
```
This will look like this:
![Bool choice](../../img/app_bool_choice.png#light-mode-only)
![Bool choice](../../img/app_bool_choice_dark.png#dark-mode-only)
* Conditional String Field: A string field presented only if the boolean field was checked:
```
{
name: string_field
title: A String
info: "Select a sting to be passed to the application"
type: string
placeholder: "a string..."
conditional: {
entries: [
{
name: boolean_field
value: True
}
]
}
target {
field: hyperparams
section: General
name: a_string
}
}
```
This will look like this:
![Conditional string field](../../img/app_cond_str.png#light-mode-only)
![Conditional string field](../../img/app_cond_str_dark.png#dark-mode-only)
* Project Selection: Choices field for a projects selection, containing all projects whose names does not begin with `example`:
```
{
name: a_project_field
title: Choose a Project
info: "The app will count the tasks in this project"
type: dropdown
required: true
autocomplete: true
choices {
source: project
value_field: id
display_field: name
filter {
fields {
name: "^(?i)(?!example).*$"
}
}
}
target {
field: hyperparams
section: General
name: a_project_id
}
}
```
This will look like this:
![Project selection](../../img/app_proj_selection.png#light-mode-only)
![Project selection](../../img/app_proj_selection_dark.png#dark-mode-only)
* Group: Group with single field option:
```
{
type: group
name: more_options_group
title: More options
collapsible: true
item_template: [
{
name: a_text_field
title: Some Text
info: "Contains some text"
type: multiline_text
required: false
target: {
field: configuration
name: text_blob
}
}
]
}
```
This will look like this:
![Group with single field](../../img/app_group.png#light-mode-only)
![Group with single field](../../img/app_group_dark.png#dark-mode-only)
### Dashboard
The Dashboard section of the configuration file describes the fields that will appear in the instance's dashboard display.
The dashboard elements are organized into lines.
The section contains the following information:
* `lines` - The array of line elements, each containing:
* `style` - CSS definitions for the line e.g setting the line height
* `contents` - An array of dashboard elements to display in a given line. Each element may have several fields:
* `title` - Text to display at the top of the field
* `type` - one of the following:
* scalar-histogram
* plot
* debug-images
* log
* scalar
* hyperparameter
* configuration
* html
* `text` - For HTML. You can refer to task elements such as hyper-parameters by using `${hyperparams.<section>.<parameter name>.value}`
* `metric` - For plot, scalar-histogram, debug-images, scalar - Name of the metric
* `variant` - For plot, scalar-histogram, debug-images, scalar - List of variants to display
* `key` - For histograms, one of the following: `iter`, `timestamp` or, `iso_time`
* `hide_legend` - Whether to hide the legend
#### Example
The example is based on the `simple-app` application `Dashboard` section:
* Dashboard Section
```
dashboard {
lines: [
]
}
```
* Html Elements: Header with two HTML elements based on the user's input:
```
{
style {
height: "initial"
}
contents: [
{
title: "HTML box with the string selected by the user"
type: html
text: "<h2>The string is ${hyperparams.General.a_string.value}</h2>"
},
{
title: "HTML box with the count of tasks"
type: html
text: "<h2>Project ${hyperparams.General.project_name.value} contains ${hyperparams.General.tasks_count.value} tasks</h2>"
}
]
}
```
This will look like this:
![HTML elements](../../img/app_html_elements.png#light-mode-only)
![HTML elements](../../img/app_html_elements_dark.png#dark-mode-only)
* Plot
```
{
contents: [
{
title: "A random plot"
type: plot
metric: "Plots"
variant: "plot"
}
]
}
```
This will look like this:
![Plot](../../img/app_plot.png#light-mode-only)
![Plot](../../img/app_plot_dark.png#dark-mode-only)
* Log
```
{
contents: [
{
title: "Logs"
type: log
}
]
}
```
This will look like this:
![Log](../../img/app_log.png#light-mode-only)
![Log](../../img/app_log_dark.png#dark-mode-only)
### Assets
Assets are optional elements used by the application configuration, to allow customization of the application display in
the ClearML web UI. They typically contain icons, empty-state HTML, and any other object required. Assets are stored in
a directory called `assets`.
To access assets from the application configuration file, use `${ASSET:<asset file name>}`. For example:
```
icon: "${ASSET:app-simple-app@2x.png}"
```
### Python Code
The code of the task that handles the application logic must be stored in a Git repository.
It is referenced by the script entry in the configuration file. For example:
```
script {
repository: "https://bitbucket.org/seematics/clearml_apps.git"
entry_point: "main.py"
working_dir: "demo_apps/simple-app"
branch: "master"
}
```
The task is run by a [Custom Applications Agent](#custom-apps-agent) within a Docker. Any packages used should be
described in a `requirements.txt` file in the working directory.
The task can read input from configuration and from the `hyperparams` section, as defined in the configuration file of
the application, and it's the task's responsibility to update any element displayed in the dashboard.
## Deploying Custom Applications
### Custom Apps Agent
Custom applications require a separate agent then the ClearML built-in applications since their code is downloaded from
a different Git repository.
To define a custom-apps agent, add the following to the `docker-compose.yml` or to the `docker-compose.override.yml`:
* In the `apiserver` service section, add the following lines in the environment to create a user for handling the custom-apps:
```
- CLEARML__secure__credentials__custom_apps_agent__user_key="${CUSTOM_APPS_AGENT_USER_KEY}"
- CLEARML__secure__credentials__custom_apps_agent__user_secret="${CUSTOM_APPS_AGENT_USER_SECRET}"
- CLEARML__secure__credentials__custom_apps_agent__role="admin"
```
* Add the custom-apps-agent service:
```custom-apps-agent:
container_name: custom-apps-agent
image: ${APPS_DAEMON_DOCKER_IMAGE}
restart: unless-stopped
privileged: true
environment:
- CLEARML_API_HOST=https://app.${SERVER_URL}/api
- CLEARML_FILES_HOST=https://files.${SERVER_URL}
- CLEARML_WEB_HOST=https://app.${SERVER_URL}
- CLEARML_API_ACCESS_KEY=${CUSTOM_APPS_AGENT_USER_KEY}
- CLEARML_API_SECRET_KEY=${CUSTOM_APPS_AGENT_USER_SECRET}
- CLEARML_AGENT_GIT_USER=${CUSTOM_APPS_AGENT_GIT_USER}
- CLEARML_AGENT_GIT_PASS=${CUSTOM_APPS_AGENT_GIT_PASSWORD}
- CLEARML_AGENT_DEFAULT_BASE_DOCKER=${APPS_WORKER_DOCKER_IMAGE}
- CLEARML_WORKER_ID=custom-apps-agent
- CLEARML_NO_DEFAULT_SERVER=true
- CLEARML_AGENT_DOCKER_HOST_MOUNT=/opt/allegro/data/agent/custom-app-agent:/root/.clearml
- CLEARML_AGENT_DAEMON_OPTIONS=--foreground --create-queue --use-owner-token --child-report-tags application --services-mode=${APPS_AGENT_INSTANCES:?err}
- CLEARML_AGENT_QUEUES=custom_apps_queue
- CLEARML_AGENT_NO_UPDATE: 1
- CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=/root/venv/bin/python3
# Disable Vault so that the apps will be downloaded with git credentials provided above, and not take any user's git credentials from the Vault.
- CLEARML_AGENT_EXTRA_DOCKER_ARGS=-e CLEARML_AGENT_DISABLE_VAULT_SUPPORT=1
- CLEARML_AGENT_SERVICES_DOCKER_RESTART=on-failure;application.resumable=True
- CLEARML_AGENT_DISABLE_SSH_MOUNT=1
- CLEARML_AGENT__AGENT__DOCKER_CONTAINER_NAME_FORMAT="custom-app-{task_id}-{rand_string:.8}"
- CLEARML_AGENT_EXTRA_DOCKER_LABELS="allegro-type=application subtype=custom"
labels:
ai.allegro.devops.allegro-software-type: "custom-apps-agent"
networks:
- backend
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /opt/allegro/data/agent/custom-app-agent:/root/.clearml
- /opt/allegro/data/agent/custom-app-agent-v2/tmp:/tmp
depends_on:
- apiserver
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
```
* Make sure to define the following variables in the `constants.env` or `runtime_created.env` configuration files:
* `CUSTOM_APPS_AGENT_USER_KEY` - A unique key for the user - any random string can be used
* `CUSTOM_APPS_AGENT_USER_SECRET` - A unique secret for the user - random UUID
* `CUSTOM_APPS_AGENT_GIT_USER` - The user for the Git repository
* `CUSTOM_APPS_AGENT_GIT_PASSWORD` - The password/app-password/token for the Git repository
### Deploying Apps
#### Packaging an App
Create a zip file with the configuration, and with the assets, if applicable.
```
zip -r simple-app.zip simple-app.app.conf assets/
```
#### Installing an App
Run the `upload_apps.py` script to upload the applications. You will need to provide credentials for an admin user in the system
```
upload_apps.py --host <apiserver url> --user <admin_user_key> --password <admin_user_secret> --files simple-app.zip
```
* `<apiserver url>` can be something like `https://api.my-server.allegro.ai` or `http://localhost:8008` if running on the server.
* `--user` and `--password` are key/secret credentials of any ClearML admin user. These can be generated in the ClearML web UI.
#### Removing an App
Applications can be uninstalled by running the `manage_apps.py` script as follows:
```
manage_apps.py delete --host <apiserver url> --user <admin_user_key> --password <admin_user_secret> -app <application id>
```

View File

@@ -0,0 +1,87 @@
---
title: Installing External Applications Server
---
ClearML supports applications, which are extensions that allow additional capabilities, such as cloud auto-scaling,
Hyperparameter Optimizations, etc. For more information, see [ClearML Applications](../../webapp/applications/apps_overview.md).
Applications run inside Docker containers, which can either reside on the ClearML Server side, or on an external server.
The `clearml-apps-agent` polls an internal applications queue, and spawns additional Docker containers for application
instances that are launched using the ClearML web UI.
This document provides a short guide on how to configure an external applications server.
## Requirements
* A server, as described in [Server Requirements](#server-requirements)
* `docker-compose.yml` file provided by ClearML
* `constants.env` - Environment file with required credentials
* Credentials to access ClearMLs enterprise Dockerhub registry
### Server Requirements
* Operating system: Linux-based
* CPU: Since applications do not produce a high CPU load, we recommend 2-4 virtual CPUs, assuming around 10 concurrent
applications are required
* Memory: Around 1 GiB of RAM is required per each concurrent application instance
* Storage: About 100 GB of storage is recommended for the system volume, with an additional 100 GB of storage for
application caching. In AWS, `m6a.xlarge` can be used for running up to 10 applications in parallel.
## Installation
:::note
Installing an external server requires removing the applications agent from the ClearML Enterprise Server. This
is done by ClearML in hosted environments, or by removing the `apps-agent` service from the `docker-compose` override
file in VPC and on-premises installations. For K8s environments, please consult the ClearML team.
:::
1. Install Docker. See [Docker documentation](https://docs.docker.com/engine/install/ubuntu/)
1. Copy the `docker-compose.yml` and `constants.env` files to `/opt/allegro`. The
`constants.env` file should contain following definitions:
* `APISERVER_URL_FOR_EXTERNAL_WORKERS` - URL of the ClearML API server
* `WEBSERVER_URL_FOR_EXTERNAL_WORKERS` - URL of the ClearML WebApp
* `FILESERVER_URL_FOR_EXTERNAL_WORKERS` - URL of the ClearML files server
* `APPS_AGENT_USER_KEY` - Provided by ClearML
* `APPS_AGENT_USER_SECRET` - Provided by ClearML
* `APPS_AGENT_GIT_USER` - Provided by ClearML (required up to ClearML Server 1.8)
* `APPS_AGENT_GIT_PASSWORD` - Provided by ClearML (required up to ClearML Server 1.8)
* `APPS_WORKER_DOCKER_IMAGE` - Provided by ClearML (required up to ClearML Server 1.8)
* `APPS_DAEMON_DOCKER_IMAGE` - Provided by ClearML
1. Log in to the Docker registry:
```
sudo docker login -username allegroaienterprise
```
1. Pull the container:
```
docker compose -env-file constants.env pull
```
1. Start the service:
```
docker compose -env-file constants.env up -d
```
## Clearing Stopped Containers
Containers of running applications that are stopped are not automatically deleted. Therefore, it is recommended to
periodically delete stopped containers. This can be done by adding the following to the cron file:
```
0 0 * * * root docker container prune --force --filter "until=96h" --filter "label=allegro-type=application"
```
## Monitoring
We recommend monitoring the following:
* Available memory
* CPU usage
* Remaining Storage
For more information contact ClearML's support team.

View File

@@ -0,0 +1,72 @@
---
title: Application Installation on On-Prem and VPC Servers
---
ClearML Applications are like plugins that allow you to manage ML workloads and automatically run recurring workflows
without any coding. Applications are installed on top of the ClearML Server.
## Requirements
To run application you will need the following:
* RAM: Make sure you have at least 400 MB of RAM per application instance.
* Applications Service: Make sure that the applications agent service is up and running on your server:
* If you are using a docker-compose solution, make sure that the clearml-apps-agent service is running.
* If you are using a Kubernetes cluster, check for the clearml-clearml-enterprise-apps component.
* Installation Files: Each application has its installation zip file. Make sure you have the relevant files for the
applications you wish to install.
* Installation Script - See below
## Air-Gapped Environments
For Air-Gapped installations you need to copy docker images to the local registry and then update the application
configuration files to use this repository. This can be achieved by using the `convert_image_registry.py` script with
the `--repo` flag. For example:
```
python convert_image_registry.py \
--apps-dir /path/to/apps/ \
--repo local_registry/clearml-apps
```
The script will change the application zip files to point to the new registry, and will output the list of containers
that need to be copied to the local registry. For example:
```
make sure allegroai/clearml-apps:hpo-1.10.0-1062 was added to local_registry/clearml-apps
```
## Installing on ClearML Server
The `upload_apps.py` script handles uploading the app packages to the ClearML Server. It requires Python3.
To see the options, run:
```commandline
python3 upload_apps.py --help
```
### Credentials
The script requires user and password (`USER_KEY`/`USER_SECRET` in the example below). These can be taken from
the credentials of an admin user, which can be generated in the ClearML web application.
### Host
For the host, supply the `apiserver` address. If running locally on the server, you can use `localhost:8008`.
### Uploading a Single Application
```commandline
python3 upload_apps.py \
--host <APISERVER_URL> \
--user <USER_KEY> \
--password <USER_SECRET> \
--files "YOUR_APP.zip"
```
### Uploading Multiple Applications
If you wish to install more than one app you can use the `--dir` instead of the `--files` argument:
```commandline
python3 upload_apps.py \
--host <APISERVER_URL> \
--user <USER_KEY> \
--password <USER_SECRET> \
--dir "DIRECTORY_CONTAINING_APPS_ZIP_FILES"
```

View File

@@ -0,0 +1,44 @@
---
title: AI Application Gateway
---
:::important Enterprise Feature
This feature is available under the ClearML Enterprise plan.
:::
Services running through a cluster orchestrator such as Kubernetes or cloud hyperscaler require meticulous configuration
to make them available as these environments do not expose their networks to external users.
The ClearML AI Application Gateway facilitates setting up secure, authenticated access to jobs running on your compute
nodes from external networks.
Using the AI Application Gateway, services are allocated externally accessible, SSL secure network routes which provide
access in adherence to ClearML RBAC privileges. The AI Application Gateway supports HTTP/S as well as raw TCP routing.
The following ClearML UI applications make use of the AI Application Gateway to provide authenticated HTTPS access to
their instances:
* GPUaaS
* [JupyterLab](../../webapp/applications/apps_jupyter_lab.md)
* [VScode](../../webapp/applications/apps_vscode.md)
* [SSH Session](../../webapp/applications/apps_ssh_session.md)
* UI Dev
* [Gradio launcher](../../webapp/applications/apps_gradio.md)
* [Streamlit launcher](../../webapp/applications/apps_streamlit.md)
* Deploy
* [vLLM Deployment](../../webapp/applications/apps_model_deployment.md)
* [Embedding Model Deployment](../../webapp/applications/apps_embed_model_deployment.md)
* [Llama.cpp Model Deployment](../../webapp/applications/apps_llama_deployment.md)
The AI Application Gateway is provided through an additional component to the ClearML Server deployment: The ClearML Task Traffic Router.
If your ClearML Deployment does not have the Task Traffic Router properly installed, these application instances may not be accessible.
#### Installation
The Task Traffic Router supports two deployment options:
* [Docker Compose](appgw_install_compose.md)
* [Kubernetes](appgw_install_k8s.md)
The deployment configuration specifies the external and internal address and port mappings for routing requests.

View File

@@ -0,0 +1,126 @@
# Docker-Compose Deployment
## Requirements
* Linux OS (x86) machine
* Root access
* Credentials for the ClearML/allegroai docker repository
* A valid ClearML Server installation
## Host configurations
### Docker installation
Installing docker and docker-compose might vary depending on the specific operating system youre using. Here is an example for AmazonLinux:
```
sudo dnf -y install docker
DOCKER_CONFIG="/usr/local/lib/docker"
sudo mkdir -p $DOCKER_CONFIG/cli-plugins
sudo curl -SL https://github.com/docker/compose/releases/download/v2.17.3/docker-compose-linux-x86_64 -o $DOCKER_CONFIG/cli-plugins/docker-compose
sudo chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose
sudo systemctl enable docker
sudo systemctl start docker
sudo docker login
```
Use the ClearML/allegroai dockerhub credentials when prompted by docker login.
### Docker-compose file
This is an example of the docker-compose file you will need:
```
version: '3.5'
services:
task_traffic_webserver:
image: allegroai/task-traffic-router-webserver:${TASK-TRAFFIC-ROUTER-WEBSERVER-TAG}
ports:
- "80:8080"
restart: unless-stopped
container_name: task_traffic_webserver
volumes:
- ./task_traffic_router/config/nginx:/etc/nginx/conf.d:ro
- ./task_traffic_router/config/lua:/usr/local/openresty/nginx/lua:ro
task_traffic_router:
image: allegroai/task-traffic-router:${TASK-TRAFFIC-ROUTER-TAG}
restart: unless-stopped
container_name: task_traffic_router
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- ./task_traffic_router/config/nginx:/etc/nginx/conf.d:rw
- ./task_traffic_router/config/lua:/usr/local/openresty/nginx/lua:rw
environment:
- LOGGER_LEVEL=INFO
- CLEARML_API_HOST=${CLEARML_API_HOST:?err}
- CLEARML_API_ACCESS_KEY=${CLEARML_API_ACCESS_KEY:?err}
- CLEARML_API_SECRET_KEY=${CLEARML_API_SECRET_KEY:?err}
- ROUTER_URL=${ROUTER_URL:?err}
- ROUTER_NAME=${ROUTER_NAME:?err}
- AUTH_ENABLED=${AUTH_ENABLED:?err}
- SSL_VERIFY=${SSL_VERIFY:?err}
- AUTH_COOKIE_NAME=${AUTH_COOKIE_NAME:?err}
- AUTH_BASE64_JWKS_KEY=${AUTH_BASE64_JWKS_KEY:?err}
- LISTEN_QUEUE_NAME=${LISTEN_QUEUE_NAME}
- EXTRA_BASH_COMMAND=${EXTRA_BASH_COMMAND}
- TCP_ROUTER_ADDRESS=${TCP_ROUTER_ADDRESS}
- TCP_PORT_START=${TCP_PORT_START}
- TCP_PORT_END=${TCP_PORT_END}
```
Create a *runtime.env* file containing the following entries:
```
TASK-TRAFFIC-ROUTER-WEBSERVER-TAG=
TASK-TRAFFIC-ROUTER-TAG=
CLEARML_API_HOST=https://api.
CLEARML_API_ACCESS_KEY=
CLEARML_API_SECRET_KEY=
ROUTER_URL=
ROUTER_NAME=main-router
AUTH_ENABLED=true
SSL_VERIFY=true
AUTH_COOKIE_NAME=
AUTH_BASE64_JWKS_KEY=
LISTEN_QUEUE_NAME=
EXTRA_BASH_COMMAND=
TCP_ROUTER_ADDRESS=
TCP_PORT_START=
TCP_PORT_END=
```
Edit it according to the following guidelines:
* `CLEARML_API_HOST`: URL usually starting with `https://api.`
* `CLEARML_API_ACCESS_KEY`: ClearML server api key
* `CLEARML_API_SECRET_KEY`: ClearML server secret key
* `ROUTER_URL`: URL for this router that was previously configured in the load balancer starting with `https://`
* `ROUTER_NAME`: unique name for this router
* `AUTH_ENABLED`: enable or disable http calls authentication when the router is communicating with the ClearML server
* `SSL_VERIFY`: enable or disable SSL certificate validation when the router is communicating with the ClearML server
* `AUTH_COOKIE_NAME`: the cookie name used by the ClearML server to store the ClearML authentication cookie. This can usually be found in the `value_prefix` key starting with `allegro_token` in `envoy.yaml` file in the ClearML server installation (`/opt/allegro/config/envoy/envoy.yaml`) (see below)
* `AUTH_SECURE_ENABLED`: enable the Set-Cookie `secure` parameter
* `AUTH_BASE64_JWKS_KEY`: value form `k` key in the `jwks.json` file in the ClearML server installation
* `LISTEN_QUEUE_NAME`: (optional) name of queue to check for tasks (if none, every task is checked)
* `EXTRA_BASH_COMMAND`: command to be launched before starting the router
* `TCP_ROUTER_ADDRESS`: router external address, can be an IP or the host machine or a load balancer hostname, depends on network configuration
* `TCP_PORT_START`: start port for the TCP Session feature
* `TCP_PORT_END`: end port port for the TCP Session feature
Run the following command to start the router:
```
sudo docker compose --env-file runtime.env up -d
```
:::Note How to find my jwkskey
The *JSON Web Key Set* (*JWKS*) is a set of keys containing the public keys used to verify any JSON Web Token (JWT).
In a docker-compose server installation, this can be found in the `CLEARML__secure__auth__token_secret` env var in the apiserver server component.
:::

View File

@@ -0,0 +1,96 @@
# Kubernetes Deployment
This guide details the installation of the ClearML AI Application Gateway, specifically the ClearML Task Router Component.
## Requirements
* Kubernetes cluster: `>= 1.21.0-0 < 1.32.0-0`
* Helm installed and configured
* Helm token to access allegroai helm-chart repo
* Credentials for allegroai docker repo
* A valid ClearML Server installation
## Optional for HTTPS
* A valid DNS entry for the new TTR instance
* A valid SSL certificate
## Helm
### Login
```
helm repo add allegroai-enterprise \
https://raw.githubusercontent.com/allegroai/clearml-enterprise-helm-charts/gh-pages \
--username <GITHUB_TOKEN> \
--password <GITHUB_TOKEN>
```
### Prepare values
Before installing the TTR create an helm-override files named `task-traffic-router.values-override.yaml`:
```
imageCredentials:
password: "<DOCKERHUB_TOKEN>"
clearml:
apiServerKey: ""
apiServerSecret: ""
apiServerUrlReference: "https://api."
jwksKey: ""
authCookieName: ""
ingress:
enabled: true
hostName: "task-router.dev"
tcpSession:
routerAddress: ""
portRange:
start:
end:
```
Edit it accordingly to this guidelines:
* `clearml.apiServerUrlReference`: url usually starting with `https://api.`
* `clearml.apiServerKey`: ClearML server api key
* `clearml.apiServerSecret`: ClearML server secret key
* `ingress.hostName`: url of router we configured previously for loadbalancer starting with `https://`
* `clearml.sslVerify`: enable or disable SSL certificate validation on apiserver calls check
* `clearml.authCookieName`: value from `value_prefix` key starting with `allegro_token` in `envoy.yaml` file in ClearML server installation.
* `clearml.jwksKey`: value form `k` key in `jwks.json` file in ClearML server installation (see below)
* `tcpSession.routerAddress`: router external address can be an IP or the host machine or a loadbalancer hostname, depends on the network configuration
* `tcpSession.portRange.start`: start port for the TCP Session feature
* `tcpSession.portRange.end`: end port port for the TCP Session feature
::: How to find my jwkskey
The *JSON Web Key Set* (*JWKS*) is a set of keys containing the public keys used to verify any JSON Web Token (JWT).
```
kubectl -n clearml get secret clearml-conf \
-o jsonpath='{.data.secure_auth_token_secret}' \
| base64 -d && echo
```
:::
The whole list of supported configuration is available with the command:
```
helm show readme allegroai-enterprise/clearml-enterprise-task-traffic-router
```
### Install
To install the TTR component via Helm use the following command:
```
helm upgrade --install \
<RELEASE_NAME> \
-n <NAME_SPACE> \
allegroai-enterprise/clearml-enterprise-task-traffic-router \
--version <CURRENT CHART VERSION> \
-f task-traffic-router.values-override.yaml
```

View File

@@ -0,0 +1,78 @@
---
title: Changing ClearML Artifacts Links
---
This guide describes how to update artifact references in the ClearML Enterprise server.
By default, artifacts are stored on the file server; however, an external storage such as AWS S3, Minio, Google Cloud
Storage, etc. may be used to store artifacts. References to these artifacts may exist in ClearML databases: MongoDB and ElasticSearch.
This procedure should be used if external storage is being migrated to a different location or URL.
:::important
This procedure does not deal with the actual migration of the data--only with changing the references in ClearML that
point to the data.
:::
## Preparation
### Version Confirmation
To change the links, use the `fix_fileserver_urls.py` script, located inside the `allegro-apiserver`
Docker container. This script will be executed from within the `apiserver` container. Make sure the `apiserver` version
is 3.20 or higher.
### Backup
It is highly recommended to back up the ClearML MongoDB and ElasticSearch databases before running the script, as the
script changes the values in the databases, and can't be undone.
## Fixing MongoDB links
1. Access the `apiserver` Docker container:
* In `docker-compose:`
```commandline
sudo docker exec -it allegro-apiserver /bin/bash
```
* In Kubernetes:
```commandline
kubectl exec -it -n clearml <clearml-apiserver-pod-name> -- bash
```
1. Navigate to the script location in the `upgrade` folder:
```commandline
cd /opt/seematics/apiserver/server/upgrade
```
1. Run the following command:
:::important
Before running the script, verify that this is indeed the correct version (`apiserver` v3.20 or higher,
or that the script provided by ClearML was copied into the container).
::::
```commandline
python3 fix_fileserver_urls.py \
--mongo-host mongodb://mongo:27017 \
--elastic-host elasticsearch:9200 \
--host-source "<old fileserver host and/or port, as in artifact links>" \
--host-target "<new fileserver host and/or port>" --datasets
```
:::note Notes
* If MongoDB or ElasticSearch services are accessed from the `apiserver` container using custom addresses, then
`--mongo-host` and `--elastic-host` arguments should be updated accordingly.
* If ElasticSearch is set up to require authentication then the following arguments should be used to pass the user
and password: `--elastic-user <es_user> --elastic-password <es_pass>`
:::
The script fixes the links in MongoDB, and outputs `cURL` commands for updating the links in ElasticSearch.
## Fixing the ElasticSearch Links
Copy the `cURL` commands printed by the script run in the previous stage, and run them one after the other. Make sure to
inspect that a "success" result was returned from each command. Depending on the amount of the data in the ElasticSearch,
running these commands may take some time.

View File

@@ -0,0 +1,118 @@
---
title: Custom Billing Events
---
ClearML supports sending custom events to selected Kafka topics. Event sending is triggered by API calls and
is available only for the companies with the `custom_events` settings set.
## Enabling Custom Events in ClearML Server
:::important Prerequisite
**Precondition**: Customer Kafka for custom events is installed and reachable from the `apiserver`.
:::
Set the following environment variables in the ClearML Enterprise helm chart under the `apiserver.extraEnv`:
* Enable custom events:
```
- name: CLEARML__services__custom_events__enabled
value: "true"
```
* Mount custom message template files into `/mnt/custom_events/templates` folder in the `apiserver` container and point
the `apiserver` into it:
```
- name: CLEARML__services__custom_events__template_folder
value: "/mnt/custom_events/templates"
```
* Configure the Kafka host for sending events:
```
- name: CLEARML__hosts__kafka__custom_events__host
value: "[<KAFKA host address:port>]"
```
Configure Kafka security parameters. Below is the example for SASL plaintext security:
```
- name: CLEARML__SECURE__KAFKA__CUSTOM_EVENTS__security_protocol
value: "SASL_PLAINTEXT"
- name: CLEARML__SECURE__KAFKA__CUSTOM_EVENTS__sasl_mechanism
value: "SCRAM-SHA-512"
- name: CLEARML__SECURE__KAFKA__CUSTOM_EVENTS__sasl_plain_username
value: "<username>"
- name: CLEARML__SECURE__KAFKA__CUSTOM_EVENTS__sasl_plain_password
value: "<password>"
```
* Define Kafka topics for lifecycle and inventory messages:
```
- name: CLEARML__services__custom_events__channels__main__topics__service_instance_lifecycle
value: "lifecycle"
- name: CLEARML__services__custom_events__channels__main__topics__service_instance_inventory
value: "inventory"
```
* For the desired companies set up the custom events properties required by the event message templates:
```
curl $APISERVER_URL/system.update_company_custom_events_config -H "Content-Type: application/json" -u $APISERVER_KEY:$APISERVER_SECRET -d'{
"company": "<company_id>",
"fields": {
"service_instance_id": "<value>",
"service_instance_name": "<value>",
"service_instance_customer_tenant_name": "<value>",
"service_instance_customer_space_name": "<value>",
"service_instance_customer_space_id": "<value>",
"parameters_connection_points": ["<value1>", "<value2>"]
}}'
```
## Sending Custom Events to the API Server
:::important Prerequisite
**Precondition:** Dedicated custom-events Redis instance installed and reachable from all the custom events deployments.
:::
Environment lifecycle events are sent directly by the `apiserver`. Other event types are emitted by the following helm charts:
* `clearml-pods-monitor-exporter` - Monitors running pods and sends container lifecycle events (should run one per cluster with a unique identifier, a UUID is required for the installation):
```
# -- Universal Unique string to identify Pods Monitor instances across worker clusters. It cannot be empty.
# Uniqueness is required across different cluster installations to preserve the reported data status.
podsMonitorUUID: "<Unique ID>"
# Interval
checkIntervalSeconds: 60
```
* `clearml-pods-inventory` - Periodically sends inventory events about running pods.
```
# Cron schedule - https://crontab.guru/
cronJob:
schedule: "@daily"
```
* `clearml-company-inventory` - Monitors Clearml companies and sends environment inventory events.
```
# Cron schedule - https://crontab.guru/
cronJob:
schedule: "@daily"
```
For every script chart add the below configuration to enable redis access and connection to the `apiserver`:
```
clearml:
apiServerUrlReference: "<APISERVER_URL>"
apiServerKey: "<APISERVER_KEY>"
apiServerSecret: "<APISERVER_SECRET>"
redisConnection:
host: "<REDIS_HOST>"
port: <REDIS_PORT>
password: "<REDIS_PWD>"
```
See all other available options to customize the `custom-events` charts by running:
```
helm show readme allegroai-enterprise/<CHART_NAME>
```

View File

@@ -0,0 +1,115 @@
---
title: Deleting Tenants from ClearML
---
The following is a step-by-step guide for deleting tenants (i.e. companies, workspaces) from ClearML.
:::caution
Deleting a tenant is a destructive operation that cannot be undone.
* Make sure you have the data prior to deleting the tenant.
* Backing up the system before deleting is recommended.
:::
The tenant deletion is done from MongoDB, ElasticsSearch, and the Fileserver.
The first two are done from within the `apiserver` container, and last from within the `fileserver` container.
Any external artifacts (ex: AWS S3, GCS, minio) can be removed manually.
## Deleting Tenants from MongoDB and ElasticSearch
1. Enter the `apiserver` in one of the following ways
* In `docker-compose`:
```
sudo docker exec -it allegro-apiserver /bin/bash
```
* In Kubernetes:
```
kubectl -n <namespace> exec -it <apiserver pod name> -c clearml-apiserver -- /bin/bash
```
1. Set the ID and the name of the company (tenant) you wish to delete
```
tenant_to_delete=<tenant-id>
company_name_to_delete="<company-name>"
```
1. Delete the company's data from MongoDB:
```
PYTHONPATH=../trains-server-repo python3 \
-m jobs.management.delete_company_data_from_mongo \
--id $tenant_to_delete \
--name <company-name> \
--delete-user
```
:::note
This also deletes the admin users. Remove `--delete-user` to avoid this.
:::
1. Delete the company's data from ElasticSearch:
```
PYTHONPATH=../trains-server-repo python3 \
-m jobs.management.cleanup_deleted_companies \
--ids $tenant_to_delete --delete-company
```
1. Exit pod/container
## Deleting Tenants from the Fileserver
To remove a tenant's data from the fileserver, you can choose one of the following methods depending on your deployment setup:
* Option 1: Delete the tenant's data from within the fileserver container or pod.
* Option 2: Delete the tenant's data externally from the host system.
### Option 1 - From Within the Fileserver
1. Enter the `fileserver` in one of the following ways
* In `docker-compose`:
```
sudo docker exec -it allegro-fileserver /bin/bash
```
* In Kubernetes:
```
kubectl -n <namespace> exec -it <fileserver pod name> -c clearml-fileserver -- /bin/bash
```
1. Run the following:
```
rm -rf /mnt/fileserver/<tenant-id>
```
1. Exit pod/container
### Option 2 - External Deletion
#### Docker compose
Run the following:
```
rm -rf /opt/allegro/data/fileserver/<tenant-id>
```
#### Kubernetes
Run the following:
```
kubectl -n <namespace> exec -it <apiserver-pod-name> -c clearml-apiserver -- /bin/bash -c "PYTHONPATH=../trains-server-repo python3 -m jobs.management.delete_company_data_from_mongo --id <tenant-id> --delete-user"
kubectl -n <namespace> exec -it <apiserver-pod-name> -c clearml-apiserver -- /bin/bash -c "PYTHONPATH=../trains-server-repo python3 -m jobs.management.cleanup_deleted_companies --ids <tenant-id> --delete-company"
kubectl -n <namespace> exec -it <fileserver-pod-name> -c clearml-fileserver -- /bin/bash -c "rm -rf /mnt/fileserver/<tenant-id>"
```

View File

@@ -0,0 +1,240 @@
---
title: Exporting and Importing ClearML Projects
---
When migrating from a ClearML Open Server to a ClearML Enterprise Server, you may need to transfer projects. This is done
using the `data_tool.py` script. This utility is available in the `apiserver` Docker image, and can be used for
exporting and importing ClearML project data for both open source and Enterprise versions.
This guide covers the following:
* Exporting data from Open Source and Enterprise servers
* Importing data into an Enterprise server
* Handling the artifacts stored on the file server.
:::note
Export instructions differ for ClearML open and Enterprise servers. Make sure you follow the guidelines that match your
server type.
:::
## Exporting Data
The export process is done by running the ***data_tool*** script that generates a zip file containing project and task
data. This file should then be copied to the server on which the import will run.
Note that artifacts stored in the ClearML ***file server*** should be copied manually if required (see [Handling Artifacts](#handling-artifacts)).
### Exporting Data from ClearML Open Servers
#### Preparation
* Make sure the `apiserver` is at least Open Source server version 1.12.0.
* Note that any `pending` or `running` tasks will not be exported. If you wish to export them, make sure to stop/dequeue
them before exporting.
#### Running the Data Tool
Execute the data tool within the `apiserver` container.
Open a bash session inside the `apiserver` container of the server:
* In docker-compose:
```commandline
sudo docker exec -it clearml-apiserver /bin/bash
```
* In Kubernetes:
```commandline
kubectl exec -it -n <clearml-namespace> <clearml-apiserver-pod-name> -- bash
```
#### Export Commands
**To export specific projects:**
```commandline
python3 -m apiserver.data_tool export --projects <project_id1> <project_id2>
--statuses created stopped published failed completed --output <output-file-name>.zip
```
As a result, you should get a `<output-file-name>.zip` file that contains all the data from the specified projects and
their children.
**To export all the projects:**
```commandline
python3 -m apiserver.data_tool export \
--all \
--statuses created stopped published failed completed \
--output <output-file-name>.zip
```
#### Optional Parameters
* `--experiments <list of experiment IDs>` - If not specified then all experiments from the specified projects are exported
* `--statuses <list of task statuses>` - Export tasks of specific statuses. If the parameter
is omitted, only `published` tasks are exported
* `--no-events` - Do not export task events, i.e. logs and metrics (scalar, plots, debug samples).
Make sure to copy the generated zip file containing the exported data.
### Exporting Data from ClearML Enterprise Servers
#### Preparation
* Make sure the `apiserver` is at least Enterprise Server version 3.18.0.
* Note that any `pending` or `running` tasks will not be exported. If you wish to export them, make sure to stop/dequeue
before exporting.
#### Running the Data Tool
Execute the data tool from within the `apiserver` docker container.
Open a bash session inside the `apiserver` container of the server:
* In `docker-compose`:
```commandline
sudo docker exec -it allegro-apiserver /bin/bash
```
* In Kubernetes:
```commandline
kubectl exec -it -n <clearml-namespace> <clearml-apiserver-pod-name> -- bash
```
#### Export Commands
**To export specific projects:**
```commandline
PYTHONPATH=/opt/seematics/apiserver/trains-server-repo python3 data_tool.py \
export \
--projects <project_id1> <project_id2> \
--statuses created stopped published failed completed \
--output <output-file-name>.zip
```
As a result, you should get `<output-file-name>.zip` file that contains all the data from the specified projects and
their children.
**To export all the projects:**
```commandline
PYTHONPATH=/opt/seematics/apiserver/trains-server-repo python3 data_tool.py \
export \
--all \
--statuses created stopped published failed completed \
--output <output-file-name>.zip
```
#### Optional Parameters
* `--experiments <list of experiment IDs>` - If not specified then all experiments from the specified projects are exported
* `--statuses <list of task statuses>` - Can be used to allow exporting tasks of specific statuses. If the parameter is
omitted, only `published` tasks are exported.
* `--no-events` - Do not export task events, i.e. logs, and metrics (scalar, plots, debug samples).
Make sure to copy the generated zip file containing the exported data.
## Importing Data
This section explains how to import the exported data into a ClearML Enterprise server.
### Preparation
* It is highly recommended to back up the ClearML databases before importing data, as import injects data into the
databases, and can't be undone.
* Make sure you are working with `apiserver` version 3.22.3 or higher.
* Make the zip file accessible from within the `apiserver` container by copying the exported data to the
`apiserver` container or to a folder on the host, which the `apiserver` is mounted to.
### Usage
The data tool should be executed from within the `apiserver` docker container.
1. Open a bash session inside the `apiserver` container of the server:
* In `docker-compose`:
```commandline
sudo docker exec -it allegro-apiserver /bin/bash
```
* In Kubernetes:
```commandline
kubectl exec -it -n <clearml-namespace> <clearml-apiserver-pod-name> -- bash
```
1. Run the data tool script in *import* mode:
```commandline
PYTHONPATH=/opt/seematics/apiserver/trains-server-repo python3 data_tool.py \
import \
<path to zip file> \
--company <company_id> \
--user <user_id>
```
* `company_id`- The default company ID used in the target deployment. Inside the `apiserver` container you can
usually get it from the environment variable `CLEARML__APISERVER__DEFAULT_COMPANY`.
If you do not specify the `--company` parameter then all the data will be imported as `Examples` (read-only)
* `user_id` - The ID of the user in the target deployment who will become the owner of the imported data
## Handling Artifacts
***Artifacts*** refers to any content which the ClearML server holds references to. This can include:
* Dataset or Hyper-Dataset frame URLs
* ClearML artifact URLs
* Model snapshots
* Debug samples
Artifacts may be stored in any external storage (e.g., AWS S3, minio, Google Cloud Storage) or in the ClearML file server.
* If the artifacts are **not** stored in the ClearML file server, they do not need to be moved during the export/import process,
as the URLs registered in ClearML entities pointing to these artifacts will not change.
* If the artifacts are stored in the ClearML file server, then the file server content must also be moved, and the URLs
in the ClearML databases must point to the new location. See instructions [below](#exporting-file-server-data-for-clearml-open-server).
### Exporting File Server Data for ClearML Open Server
Data in the file server is organized by project. For each project, all data references by entities in that project is
stored in a folder bearing the name of the project. This folder can be located in:
```
/opt/clearml/data/fileserver/<project name>
```
The entire projects' folders content should be copied to the target server (see [Importing Fileserver Data](#importing-file-server-data)).
### Exporting File Server Data for ClearML Enterprise Server
Data in the file server is organized by tenant and project. For each project, all data references by entities in that
project is stored in a folder bearing the name of the project. This folder can be located in:
```
/opt/allegro/data/fileserver/<company_id>/<project name>
```
The entire projects' folders content should be copied to the target server (see [Importing Fileserver Data](#importing-file-server-data)).
## Importing File Server Data
### Copying the Data
Place the exported projects' folder(s) content into the target file server's storage in the following folder:
```
/opt/allegro/data/fileserver/<company_id>/<project name>
```
### Fixing Registered URLs
Since URLs pointing to the file server contain the file server's address, these need to be changed to the address of the
new file server.
Note that this is not required if the new file server is replacing the old file server and can be accessed using the same
exact address.
Once the projects' data has been copied to the target server, and the projects themselves were imported, see
[Changing ClearML Artifacts Links](change_artifact_links.md) for information on how to fix the URLs.

View File

@@ -0,0 +1,543 @@
---
title: Multi-Tenant Service on Kubernetes
---
This guide provides step-by-step instructions for installing a ClearML multi-tenant service on a Kubernetes cluster.
It covers the installation and configuration steps necessary to set up ClearML in a cloud environment, including
enabling specific features and setting up necessary components.
## Prerequisites
* A Kubernetes cluster
* Credentials for the ClearML Enterprise Helm chart repository
* Credentials for the ClearML Enterprise DockerHub repository
* Credentials for the ClearML billing DockerHub repository
* URL for downloading the ClearML Enterprise applications configuration
* ClearML Billing server Helm chart
## Setting up ClearML Helm Repository
You need to add the ClearML Enterprise Helm repository to your local Helm setup. This repository contains the Helm
charts required for deploying the ClearML Server and its components.
To add the ClearML Enterprise repository using the following command. Replace `<TOKEN>` with the private tokens sent to
you by ClearML:
```
helm repo add allegroai-enterprise <https://raw.githubusercontent.com/allegroai/clearml-enterprise-helm-charts/gh-pages> --username <TOKEN> --password <TOKEN>
```
## Enabling Dynamic MIG GPUs
Allocating GPU fractions dynamically make use of the NVIDIA GPU operator.
1. Add the NVIDIA Helm repository:
```
helm repo add nvidia <https://nvidia.github.io/gpu-operator>
helm repo update
```
2. Install the NVIDIA GPU operator with the following configuration:
```
helm install -n gpu-operator \\
gpu-operator \\
nvidia/gpu-operator \\
--create-namespace \\
--set migManager.enabled=false \\
--set mig.strategy=mixed
```
## Install CDMO Chart
The ClearML Dynamic MIG Operator (CDMO) enables running AI workloads on k8s with optimized hardware utilization and
workload performance by facilitating MIG GPUs partitioning.
1. Prepare the `overrides.yaml` file so it will contain the following content. Replace `<allegroaienterprise_DockerHub_TOKEN>`
with the private token provided by ClearML:
```
imageCredentials:
password: "<allegroaienterprise_DockerHub_TOKEN>"
```
2. Install the CDMO chart:
```
helm install -n cdmo-operator \\
cdmo \\
allegroai-enterprise/clearml-dynamic-mig-operator \\
--create-namespace \\
-f overrides.yaml
```
### Enable MIG support
1. Enable dynamic MIG support on your cluster by running the following command on **all nodes used for training** (run
for **each GPU** ID in your cluster):
```
nvidia-smi -i <gpu_id> -mig 1
```
This command can be issued from inside the `nvidia-device-plugin-daemonset` pod on the related node.
If the result of the previous command indicates that a node reboot is necessary, perform the reboot.
2. After enabling MIG support, label the MIG GPU nodes accordingly. This labeling helps in identifying nodes configured
with MIG support for resource management and scheduling:
```
kubectl label nodes <node-name> "cdmo.clear.ml/gpu-partitioning=mig"
```
## Install ClearML Chart
Install the ClearML chart with the required configuration:
1. Prepare the `overrides.yaml` file and input the following content. Make sure to replace `<BASE_DOMAIN>` and `<SSO_*>`
with a valid domain that will have records pointing to the ingress controller accordingly.
The credentials specified in `<SUPERVISOR_USER_KEY>` and `<SUPERVISOR_USER_SECRET>` can be used to log in as the
supervisor user in the web UI.
Note that the `<SUPERVISOR_USER_EMAIL>` value must be explicitly quoted. To do so, put `\\"` around the quoted value.
For example `"\\"email@example.com\\””`
```
imageCredentials:
password: "<allegroaienterprise_DockerHub_TOKEN>"
clearml:
cookieDomain: "<BASE_DOMAIN>"
apiserver:
image:
tag: "3.21.6-1443"
ingress:
enabled: true
hostName: "api.<BASE_DOMAIN>"
service:
type: ClusterIP
extraEnvs:
- name: CLEARML__billing__enabled:
value: "true"
- name: CLEARML__HOSTS__KAFKA__BILLING__HOST
value: "[clearml-billing-kafka.clearml-billing:9092]"
- name: CLEARML__HOSTS__REDIS__BILLING__HOST
value: clearml-billing-redis-master.clearml-billing
- name: CLEARML__HOSTS__REDIS__BILLING__DB
value: "2"
- name: CLEARML__SECURE__KAFKA__BILLING__security_protocol
value: SASL_PLAINTEXT
- name: CLEARML__SECURE__KAFKA__BILLING__sasl_mechanism
value: SCRAM-SHA-512
- name: CLEARML__SECURE__KAFKA__BILLING__sasl_plain_username
value: billing
- name: CLEARML__SECURE__KAFKA__BILLING__sasl_plain_password
value: "jdhfKmsd1"
- name: CLEARML__secure__login__sso__oauth_client__auth0__client_id
value: "<SSO_CLIENT_ID>"
- name: CLEARML__secure__login__sso__oauth_client__auth0__client_secret
value: "<SSO_CLIENT_SECRET>"
- name: CLEARML__services__login__sso__oauth_client__auth0__base_url
value: "<SSO_CLIENT_URL>"
- name: CLEARML__services__login__sso__oauth_client__auth0__authorize_url
value: "<SSO_CLIENT_AUTHORIZE_URL>"
- name: CLEARML__services__login__sso__oauth_client__auth0__access_token_url
value: "<SSO_CLIENT_ACCESS_TOKEN_URL>"
- name: CLEARML__services__login__sso__oauth_client__auth0__audience
value: "<SSO_CLIENT_AUDIENCE>"
- name: CLEARML__services__organization__features__user_management_advanced
value: "true"
- name: CLEARML__services__auth__ui_features_per_role__user__show_datasets
value: "false"
- name: CLEARML__services__auth__ui_features_per_role__user__show_orchestration
value: "false"
- name: CLEARML__services__applications__max_running_apps_per_company
value: "3"
- name: CLEARML__services__auth__default_groups__users__features
value: "[\\"applications\\"]"
- name: CLEARML__services__auth__default_groups__admins__features
value: "[\\"config_vault\\", \\"experiments\\", \\"queues\\", \\"show_projects\\", \\"resource_dashboard\\", \\"user_management\\", \\"user_management_advanced\\", \\"app_management\\", \\"sso_management\\", \\"service_users\\", \\"resource_policy\\"]"
- name: CLEARML__services__workers__resource_usages__supervisor_company
value: "d1bd92a3b039400cbafc60a7a5b1e52b" # Default company
- name: CLEARML__secure__credentials__supervisor__role
value: "system"
- name: CLEARML__secure__credentials__supervisor__allow_login
value: "true"
- name: CLEARML__secure__credentials__supervisor__user_key
value: "<SUPERVISOR_USER_KEY>"
- name: CLEARML__secure__credentials__supervisor__user_secret
value: "<SUPERVISOR_USER_SECRET>"
- name: CLEARML__secure__credentials__supervisor__sec_groups
value: "[\\"users\\", \\"admins\\", \\"queue_admins\\"]"
- name: CLEARML__secure__credentials__supervisor__email
value: "\\"<SUPERVISOR_USER_EMAIL>\\""
- name: CLEARML__apiserver__company__unique_names
value: "true"
fileserver:
ingress:
enabled: true
hostName: "file.<BASE_DOMAIN>"
service:
type: ClusterIP
webserver:
image:
tag: "3.21.3-1657"
ingress:
enabled: true
hostName: "app.<BASE_DOMAIN>"
service:
type: ClusterIP
clearmlApplications:
enabled: true
```
2. Install ClearML
```
helm install -n clearml \\
clearml \\
allegroai-enterprise/clearml-enterprise \\
--create-namespace \\
-f overrides.yaml
```
## Shared Redis installation
Set up a shared Redis instance that multiple components of your ClearML deployment can use:
1. lf not there already, add Bitnami repository:
```
helm repo add bitnami <https://charts.bitnami.com/bitnami>
```
2. Prepare the `overrides.yaml` with the following content:
```
auth:
password: "sdkWoq23"
```
3. Install Redis:
```
helm install -n redis-shared \\
redis \\
bitnami/redis \\
--create-namespace \\
--version=17.8.3 \\
-f overrides.yaml
```
## Install Billing Chart
The billing chart is not available as part of the ClearML private Helm repo. `clearml-billing-1.1.0.tgz` is directly
provided by the ClearML team.
1. Prepare `values.override.yaml` - Create the file with the following content, replacing `<billing_DockerHub_TOKEN>`
with the appropriate value:
```
imageCredentials:
username: dockerhubcustpocbillingaccess
password: "<billing_DockerHub_TOKEN>"
```
1. Install the billing chart:
```
helm install -n clearml-billing \\
clearml-billing \\
clearml-billing-1.0.0.tgz \\
--create-namespace \\
-f overrides.yaml
```
## Namespace Isolation using Network Policies
For enhanced security, isolate namespaces using the following NetworkPolicies:
```
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: clearml
spec:
podSelector: {}
policyTypes:
- Ingress
ingress:
- from:
- podSelector: {}
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-clearml-ingress
namespace: clearml
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: clearml-clearml-enterprise
policyTypes:
- Ingress
ingress:
- from:
- ipBlock:
cidr: 0.0.0.0/0
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-clearml-ingress
namespace: clearml-billing
spec:
podSelector: {}
policyTypes:
- Ingress
ingress:
- from:
- podSelector: {}
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: clearml
```
## Applications Installation
To install ClearML GUI applications, follow these steps:
1. Get the apps to install and the installation script by downloading and extracting the archive provided by ClearML
```
wget -O apps.zip "<ClearML enterprise applications configuration download url>"
unzip apps.zip
```
2. Install the apps:
```
python upload_apps.py \\ --host $APISERVER_ADDRESS \\ --user $APISERVER_USER --password $APISERVER_PASSWORD \\ --dir apps -ml
```
## Tenant Configuration
Create tenants and corresponding admin users, and set up an SSO domain whitelist for secure access. To configure tenants,
follow these steps (all requests must be authenticated by root or admin). Note that placeholders like `<PLACEHOLDER>`
must be substituted with valid domain names or values from responses.
1. Define the following variables:
```
APISERVER_URL="https://api.<BASE_DOMAIN>"
APISERVER_KEY="GGS9F4M6XB2DXJ5AFT9F"
APISERVER_SECRET="2oGujVFhPfaozhpuz2GzQfA5OyxmMsR3WVJpsCR5hrgHFs20PO"
```
2. Create a *Tenant* (company):
```
curl $APISERVER_URL/system.create_company \\
-H "Content-Type: application/json" \\
-u $APISERVER_KEY:$APISERVER_SECRET \\
-d '{"name":"<TENANT_NAME>"}'
```
This returns the new Company ID (`<COMPANY_ID>`). If needed, you can list all companies with the following command:
```
curl -u $APISERVER_KEY:$APISERVER_SECRET $APISERVER_URL/system.get_companies
```
3. Create an *Admin User*:
```
curl $APISERVER_URL/auth.create_user \\
-H "Content-Type: application/json" \\
-u $APISERVER_KEY:$APISERVER_SECRET \\
-d '{"name":"<ADMIN_USER_NAME>","company":"<COMPANY_ID>","email":"<ADMIN_USER_EMAIL>","role":"admin"}'
```
This returns the new User ID (`<USER_ID>`).
4. Generate *Credentials* for the new Admin User:
```
curl $APISERVER_URL/auth.create_credentials \\
-H "Content-Type: application/json" \\
-H "X-Clearml-Impersonate-As: <USER_ID>" \\
-u $APISERVER_KEY:$APISERVER_SECRET
```
This returns a set of key and secret credentials associated with the new Admin User.
5. Create an SSO Domain *Whitelist*. The `<USERS_EMAIL_DOMAIN>` is the email domain setup for users to access through SSO.
```
curl $APISERVER_URL/login.set_domains \\
-H "Content-Type: application/json" \\
-H "X-Clearml-Act-As: <USER_ID>" \\
-u $APISERVER_KEY:$APISERVER_SECRET \\
-d '{"domains":["<USERS_EMAIL_DOMAIN>"]}'
```
### Install ClearML Agent Chart
To install the ClearML Agent Chart, follow these steps:
1. Prepare the `overrides.yaml` file with the following content. Make sure to replace placeholders like
`<allegroaienterprise_DockerHub_TOKEN>`, `<BASE_DOMAIN>`, and `<TENANT_NAMESPACE>` with the appropriate values:
```
imageCredentials:
password: "<allegroaienterprise_DockerHub_TOKEN>"
clearml:
agentk8sglueKey: "-" # TODO --> Generate credentials from API in the new tenant
agentk8sglueSecret: "-" # TODO --> Generate credentials from API in the new tenant
agentk8sglue:
extraEnvs:
- name: CLEARML_K8S_SUPPORT_SUSPENSION
value: "1"
- name: CLEARML_K8S_PORTS_MODE_ON_REQUEST_ONLY
value: "1"
- name: CLEARML_AGENT_REDIS_HOST
value: "redis-master.redis-shared"
- name: CLEARML_AGENT_REDIS_PORT
value: "6379"
- name: CLEARML_AGENT_REDIS_DB
value: "0"
- name: CLEARML_AGENT_REDIS_PASSWORD
value: "sdkWoq23"
image:
tag: 1.24-1.8.1rc99-159
monitoredResources:
maxResources: 3
minResourcesFieldName: "metadata|labels|required-resources"
maxResourcesFieldName: "metadata|labels|required-resources"
apiServerUrlReference: "https://api.<BASE_DOMAIN>"
fileServerUrlReference: "https://file.<BASE_DOMAIN>"
webServerUrlReference: "https://app.<BASE_DOMAIN>"
defaultContainerImage: "python:3.9"
debugMode: true
createQueues: true
queues:
default:
templateOverrides:
labels:
required-resources: "0.5"
billing-monitored: "true"
queueSettings:
maxPods: 10
gpu-fraction-1_00:
templateOverrides:
labels:
required-resources: "1"
billing-monitored: "true"
resources:
limits:
nvidia.com/mig-7g.40gb: 1
clear.ml/fraction-1: "1"
queueSettings:
maxPods: 10
gpu-fraction-0_50:
templateOverrides:
labels:
required-resources: "0.5"
billing-monitored: "true"
resources:
limits:
nvidia.com/mig-3g.20gb: 1
clear.ml/fraction-1: "0.5"
queueSettings:
maxPods: 10
gpu-fraction-0_25:
templateOverrides:
labels:
required-resources: "0.25"
billing-monitored: "true"
resources:
limits:
nvidia.com/mig-2g.10gb: 1
clear.ml/fraction-1: "0.25"
queueSettings:
maxPods: 10
sessions:
portModeEnabled: false # set to true when using TCP ports mode
agentID: "<TENANT_NAMESPACE>"
externalIP: 0.0.0.0 # IP of one of the workers
startingPort: 31010 # be careful to not overlap other tenants (startingPort + maxServices)
maxServices: 10
```
2. Install the ClearML Agent Chart in the specified tenant namespace:
```
helm install -n <TENANT_NAMESPACE> \\
clearml-agent \\
allegroai-enterprise/clearml-enterprise-agent \\
--create-namespace \\
-f overrides.yaml
```
3. Create a queue via the API:
```
curl $APISERVER_URL/queues.create \\
-H "Content-Type: application/json" \\
-H "X-Clearml-Impersonate-As: 75557e2ab172405bbe153705e91d1782" \\
-u $APISERVER_KEY:$APISERVER_SECRET \\
-d '{"name":"default"}'
```
### Tenant Namespace isolation with NetworkPolicies
To ensure network isolation for each tenant, you need to create a `NetworkPolicy` in the tenant namespace. This way
the entire namespace/tenant will not accept any connection from other namespaces.
Create a `NetworkPolicy` in the tenant namespace with the following configuration:
```
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
spec:
podSelector: {}
policyTypes:
- Ingress
ingress:
- from:
- podSelector: {}
```
### Install Task Traffic Router Chart
Install the [Task Traffic Router](appgw.md) in your Kubernetes cluster, allowing it to manage and route tasks:
1. Prepare the `overrides.yaml` file with the following content:
```
imageCredentials:
password: "<allegroaienterprise_DockerHub_TOKEN>"
clearml:
apiServerUrlReference: "<http://clearml-enterprise-apiserver.clearml:8008>"
apiserverKey: "<TENANT_KEY>"
apiserverSecret: "<TENANT_SECRET>"
jwksKey: "ymLh1ok5k5xNUQfS944Xdx9xjf0wueokqKM2dMZfHuH9ayItG2"
ingress:
enabled: true
hostName: "<unique url in same domain as apiserver/webserver>"
```
2. Install Task Traffic Router in the specified tenant namespace:
```
helm install -n <TENANT_NAMESPACE> \\
clearml-ttr \\
allegroai-enterprise/clearml-task-traffic-router \\
--create-namespace \\
-f overrides.yaml
```

View File

@@ -0,0 +1,350 @@
---
title: On-Premises on Ubuntu
---
This guide provides step-by-step instruction for installing the ClearML Enterprise Server on a single Linux Ubuntu server.
## Prerequisites
The following are required for the ClearML on-premises server:
- At least 8 CPUs
- At least 32 GB RAM
- OS - Ubuntu 20 or higher
- 4 Disks
- Root
- For storing the system and dockers
- Recommended at least 30 GB
- mounted to `/`
- Docker
- For storing Docker data
- Recommended at least 80GB
- mounted to `/var/lib/docker` with permissions 710
- Data
- For storing Elastic and Mongo databases
- Size depends on the usage. Recommended not to start with less than 100 GB
- Mounted to `/opt/allegro/data`
- File Server
- For storing `fileserver` files (models and debug samples)
- Size depends on usage
- Mounted to `/opt/allegro/data/fileserver`
- User for running ClearML services with administrator privileges
- Ports 8080, 8081, and 8008 available for the ClearML Server services
In addition, make sure you have the following (provided by ClearML):
- Docker hub credentials to access the ClearML images
- `docker-compose.yml` - The main compose file containing the services definitions
- `docker-compose.override.yml` - The override file containing additions that are server specific, such as SSO integration
- `constants.env` - The `env` file contains values of items in the `docker-compose` that are unique for
a specific environment, such as keys and secrets for system users, credentials, and image versions. The constant file
should be reviewed and modified prior to the server installation
## Installing ClearML Server
### Preliminary Steps
1. Install Docker CE
```
https://docs.docker.com/install/linux/docker-ce/ubuntu/
```
1. Verify the Docker CE installation:
```
docker run hello-world
```
Expected output:
```
Hello from Docker!
This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub. (amd64)
3. The Docker daemon created a new container from that image which runs the executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it to your terminal.
```
1. Install `docker-compose`:
```
sudo curl -L "https://github.com/docker/compose/releases/download/1.24.1/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
```
:::note
You might need to downgrade urlib3 by running `sudo pip3 install urllib3==1.26.2`
:::
1. Increase `vm.max_map_count` for Elasticsearch in Docker:
```
echo "vm.max_map_count=262144" > /tmp/99-allegro.conf
echo "vm.overcommit_memory=1" >> /tmp/99-allegro.conf
echo "fs.inotify.max_user_instances=256" >> /tmp/99-allegro.conf
sudo mv /tmp/99-allegro.conf /etc/sysctl.d/99-allegro.conf
sudo sysctl -w vm.max_map_count=262144
sudo service docker restart
```
1. Disable THP. Create the `/etc/systemd/system/disable-thp.service` service file with the following content:
:::important
The `ExecStart` string (Under `[Service]) should be a single line.
:::
```
[Unit]
Description=Disable Transparent Huge Pages (THP)
[Service]
Type=simple
ExecStart=/bin/sh -c "echo 'never' > /sys/kernel/mm/transparent_hugepage/enabled && echo 'never' > /sys/kernel/mm/transparent_hugepage/defrag"
[Install]
WantedBy=multi-user.target
```
1. Enable the online service:
```
sudo systemctl daemon-reload
sudo systemctl enable disable-thp
```
1. Restart the machine
### Installing the Server
1. Remove any previous installation of ClearML Server
```
sudo rm -R /opt/clearml/
sudo rm -R /opt/allegro/
```
1. Create local directories for the databases and storage:
```
sudo mkdir -pv /opt/allegro/data/elastic7plus
sudo chown 1000:1000 /opt/allegro/data/elastic7plus
sudo mkdir -pv /opt/allegro/data/mongo_4/configdb
sudo mkdir -pv /opt/allegro/data/mongo_4/db
sudo mkdir -pv /opt/allegro/data/redis
sudo mkdir -pv /opt/allegro/logs/apiserver
sudo mkdir -pv /opt/allegro/documentation
sudo mkdir -pv /opt/allegro/data/fileserver
sudo mkdir -pv /opt/allegro/logs/fileserver
sudo mkdir -pv /opt/allegro/logs/fileserver-proxy
sudo mkdir -pv /opt/allegro/data/fluentd/buffer
sudo mkdir -pv /opt/allegro/config/webserver_external_files
sudo mkdir -pv /opt/allegro/config/onprem_poc
```
1. Copy the following ClearML configuration files to `/opt/allegro`
* `constants.env`
* `docker-compose.override.yml`
* `docker-compose.yml`
1. Create an initial ClearML configuration file `/opt/allegro/config/onprem_poc/apiserver.conf` with a fixed user:
```
auth {
fixed_users {
enabled: true,
users: [
{username: "support", password: "<enter password here>", admin: true, name: "allegro.ai Support User"},
]
}
}
```
1. Log into the Docker Hub repository using the username and password provided by ClearML:
```
sudo docker login -u=$DOCKERHUB_USER -p=$DOCKERHUB_PASSWORD
```
1. Start the `docker-compose` by changing directories to the directory containing the docker-compose files and running the following command:
sudo docker-compose --env-file constants.env up -d
1. Verify web access by browsing to your URL (IP address) and port 8080.
```
http://<server_ip_here>:8080
```
## Security
To ensure the server's security, it's crucial to open only the necessary ports.
### Working with HTTP
Directly accessing the server using `HTTP` is not recommended. However, if you choose to do so, only the following ports
should be open to any location where a ClearML client (`clearml-agent`, SDK, or web browser) may operate:
* Port 8080 for accessing the WebApp
* Port 8008 for accessing the API server
* Port 8081 for accessing the file server
### Working with TLS / HTTPS
TLS termination through an external mechanism, such as a load balancer, is supported and recommended. For such a setup,
the following subdomains should be forwarded to the corresponding ports on the server:
* `https://api.<domain>` should be forwarded to port 8008
* `https://app.<domain>` should be forwarded to port 8080
* `https://files.<domain>` should be forwarded to port 8081
**Critical: Ensure no other ports are open to maintain the highest level of security.**
Additionally, ensure that the following URLs are correctly configured in the server's environment file:
```
WEBSERVER_URL_FOR_EXTERNAL_WORKERS=https://app.<your-domain>
APISERVER_URL_FOR_EXTERNAL_WORKERS=https://api.<your-domain>
FILESERVER_URL_FOR_EXTERNAL_WORKERS=https://files.<your-domain>
```
:::note
If you prefer to use URLs that do not begin with `app`, `api`, or `files`, you must also add the following configuration
for the web server in your `docker-compose.override.yml` file:
```
webserver:
environment:
- WEBSERVER__displayedServerUrls={"apiServer":"$APISERVER_URL_FOR_EXTERNAL_WORKERS","filesServer":"$FILESERVER_URL_FOR_EXTERNAL_WORKERS"}
```
:::
## Backups
The main components that contain data are the databases:
* MongoDB
* ElasticSearch
* File server
It is recommended to back them periodically.
### Fileserver
It is recommended to back up the entire file server volume.
* Recommended to perform at least a daily backup.
* Recommended backup retention of 2 days at the least.
### ElasticSearch
Please refer to [ElasticSearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-restore.html) for creating snapshots.
#### MongoDB
Please refer to [MongoDBs documentation](https://www.mongodb.com/docs/manual/core/backups/) for backing up / restoring.
## Monitoring
The following monitoring is recommended:
### Basic Hardware Monitoring
#### CPU
CPU usage varies depending on system usage. We recommend to monitor CPU usage and to alert when the usage is higher
than normal. Recommended starting alerts would be 5-minute CPU load
level of 5 and 10, and adjusting according to performance.
#### RAM
Available memory usage also varies depending on system usage. Due to spikes in usage when performing certain tasks, 6-8 GB
of available RAM is recommended as the standard baseline. Some use cases may require more. Thus, we recommend to have 8 GB
of available memory on top of the regular system usage. Alert levels should alert if the available memory is below normal.
##### Disk Usage
There are several disks used by the system. We recommend monitoring all of them. Standard alert levels are 20%, 10% and
5% of free disk space.
### Service Availability
The following services should be monitored periodically for availability and for response time:
* `apiserver` - [http://localhost:10000/api/debug.ping](http://localhost:10000/api/debug.ping) should return HTTP 200
* `webserver` - [http://localhost:10000](http://localhost:10000/) should return HTTP 200
* `fileserver` - [http://localhost:10000/files/](http://localhost:10000/files/) should return HTTP 405 ("method not allowed")
### API Server Docker Memory Usage
A usage spike can happen during normal operation. But very high spikes (above 6GB) are not expected. We recommend using
`docker stats` to get this information.
For example, the following comment retrieves the API server's information from the Docker server:
```
sudo curl -s --unix-socket /var/run/docker.sock http://localhost/containers/allegro-apiserver/stats?stream=false
```
We recommend monitoring the API server memory in addition to the system's available RAM. Alerts should be triggered
when memory usage of the API server exceeds the normal behavior. A starting value can be 6 GB.
### Backup Failures
It is also highly recommended to monitor the backups and to alert if a backup has failed.
## Troubleshooting
In normal operation mode, all services should be up, and a call to `sudo docker ps` should yield the list of services.
If a service fails, it is usually due to one of the following:
* Lack of required resources such as storage or memory
* Incorrect configuration
* Software anomaly
When a service fails, it should automatically restart. However, if the cause of the failure is persistent, the service
will fail again. If a service fails, do the following:
### Check the Log
Run:
```
sudo docker <container name or ID> logs -n 1000
```
See if there is an error message in the log that can explain the failure.
### Check the Server's Environment
The system should be constantly monitored, however it is important to check the following:
* **Storage space**: run `sudo du -hs /`
* **RAM**:
* Run `vmstat -s` to check available RAM
* Run: `top` to check the processes.
:::note
Some operations, such as complex queries, may cause a spike in memory usage. Therefore, it is recommended to have at least 8GB of free RAM available.
:::
* **Network**: Make sure that there is external access to the services
* **CPU**: The best indicator of the need of additional compute resources is high CPU usage of the `apiserver` and `apiserver-es` services.
* Examine the usage of each service using `sudo docker stats`
* If there is a need to add additional CPUs after updating the server, increase the number of workers on the `apiserver`
service by changing the value of `APISERVER_WORKERS_NUMBER` in the `constants.env` file (up to one additional worker per additional core).
### API Server
In case of failures in the `allegro-apiserver` container, or in cases in which the web application gets unexpected errors,
and the browser's developer tools (F12) network tab shows error codes being returned by the server, also check the log
of the `apiserver` which is written to `/opt/allegro/logs/apiserver/apiserver.log`.
Additionally, you can check the server availability using:
```
curl http://localhost:8008/api/debug.ping
```
This should return HTTP 200.
### Web Server
Check the webserver availability by running the following:
```
curl http://<servers IP address>:8080/configuration.json |
```

View File

@@ -0,0 +1,46 @@
---
title: Group Integration in Active Directory SAML
---
Follow this guide to integrate groups from Active Directory with ClearML.
## Actions in Active Directory
Make sure that the group claims are passed to the ClearML app.
## Actions in ClearML
### Creating the Groups
* Groups integration is disabled by default
* Groups are not auto-created and need to be created manually in ClearML using the [Users & Groups](../../webapp/settings/webapp_settings_users.md#user-groups)
page in the ClearML web UI, or using the ClearML API.
* If a group does not exist in ClearML, the user will be created, but will not be assigned to any group.
* Group claim used by ClearML is `groups` by default
* Group name is taken from the first CN of the full DN path. For example, for the following DN: `CN=test, OU=unit, DU=mycomp`,
the group name in ClearML will be `test`
* The group name matching in ClearML is case-sensitive by default
### Configuring ClearML Group Integration
To enable ClearML group integration set the following environment variable:
```
CLEARML__services__login__sso__saml_client__microsoft_ad__groups__enabled=true
```
To configure groups that should automatically become admins in ClearML set the following environment variable:
```
CLEARML__services__login__sso__saml_client__microsoft_ad__groups__admins=[<admin_group_name1>, <admin_group_name2>, ...]
```
To change the the default Group Claim set the following environment variable:
```
CLEARML__services__login__sso__saml_client__microsoft_ad__groups__claim=...
```
To make group matching case insensitive set the following environment variable:
```
CLEARML__services__login__sso__saml_client__microsoft_ad__groups__case_sensitive=false
```
In order to prohibit the users, who do not belong to any of the AD groups created in ClearML from signing up set the following environment variable:
```
CLEARML__services__login__sso__saml_client__microsoft_ad__groups__prohibit_user_signup_if_not_in_group=true
```

View File

@@ -0,0 +1,225 @@
---
title: KeyCloak IdP Configuration
---
This procedure is a step-by-step guide of the configuration process for the ClearML Enterprise Server with the KeyCloak IdP.
In the following sections, the term "publicly accessible" does not have to mean open to the entire world or publicly
accessible from the Internet, it simply means accessible to your users from their workstations (typically when using a
browser).
In the following sections, you will be instructed to set up different environment variables for the ClearML Server. If
using a `docker-compose` deployment, these should be defined in your `docker-compose.override.yaml` file, under the
`apiserver` service environment variables, as follows:
```
services:
...
apiserver:
...
environment:
<name>=<value>
...
```
If using a Kubernetes deployment, these should be set in the ClearML Enterprise server chart values override file, under
the `.Values.apiserver.extraEnvs` array section, as follows:
```
...
apiserver:
extraEnvs:
- name: <name>
value: "<value>"
- ...
```
All examples below are provided in the Kubernetes format.
## Prerequisites
* An existing ClearML Enterprise server / control-plane installation (using `docker-compose` or Kubernetes), which is
set up with a publicly accessible endpoint fo the ClearML WebApp
* A KeyCloak IdP installed with a publicly accessible endpoint, with you as admin having access to the KeyCloak administration UI.
## Configuration
### Basic Setup
#### KeyCloak Configuration
In the KeyCloak administration UI:
1. Register a new ClearML app with the callback url: `<ClearML_webapp_address>/callback_keycloak`
2. Make sure that the claims representing `user_id`, `email` and `user name` are returned
3. Make a note of the `client_id`, `client_secret`, `Auth url` and `Access token url` for configuration in the ClearML Server.
#### ClearML Server Configuration
In the ClearML Server deployment, set the environment variables specified below.
##### KeyCloak Base URL
Use the start of the token or authorization endpoint, usually the part just before `protocol/openid-connect/...`
```
- name: CLEARML__services__login__sso__oauth_client__keycloak__base_url
value: "<base url>"
```
##### KeyCloak Authorization Endpoint
```
- name: CLEARML__services__login__sso__oauth_client__keycloak__authorize_url
value: "<authorization endpoint>"
```
##### KeyCloak Access Token Endpoint
```
- name: CLEARML__services__login__sso__oauth_client__keycloak__access_token_url
value: "<token endpoint>"
```
##### KeyCloak Client ID
The client ID is obtained when creating the KeyCloak ClearML App.
```
- name: CLEARML__secure__login__sso__oauth_client__keycloak__client_id
value: "<clinet_id>"
```
##### KeyCloak Client Secret
The client secret is obtained when creating the KeyCloak ClearML App.
```
- name: CLEARML__secure__login__sso__oauth_client__keycloak__client_secret
value: "<client_secret>"
```
##### Automatic User Creation Support
Usually, when using IdPs in ClearML, the ClearML Server will map users signing in to the server into tenants (companies)
using predefined whitelists and specific invitations (users explicitly added by admins).
To support automatic user creation in a trusted environment, where all users signing in using this IdP are automatically
added to the same tenant (company), the following environment variable should be set:
```
- name: CLEARML__secure__login__sso__oauth_client__keycloak__default_company
value: "<company_id>"
```
### User Groups Integration
This option allows automatically synchronizing group membership from KeyCloak into existing ClearML User Groups when
logging in users (this is done on every user login, not just on user sign-in).
Make sure a ClearML User Group exists for each potential KeyCloak group that should be synchronized to
prevent an uncontrolled proliferation of user groups. The ClearML server will not automatically create user groups in
this mode.
#### Keycloak Configuration
* When configuring the Open ID client for ClearML:
* Navigate to the `Client Scopes` tab
* Click on the first row `<clearml client>-dedicated`
* Click `Add Mapper → By configuration` and then select the `Group membership` option
* In the opened dialog enter the name `groups` and Token claim name `groups`
* Uncheck the `Full group path` option and save the mapper
* To validate yourself:
* Return to the `Client Details → Client` scope tab
* Go to the `Evaluate` sub-tab and select a user that has any group memberships
* On the right side navigate to `Generated ID` token and then to `Generated User Info`
* Inspect that in both cases you can see the groups claim in the displayed user data
#### ClearML Server Configuration
Set the following environment variables for the `apiserver` service:
```
- name: CLEARML__services__login__sso__oauth_client__keycloak__groups__enabled
value: "true"
- name: CLEARML__services__login__sso__oauth_client__keycloak__groups__claim
value: "groups"
- name: CLEARML__services__login__sso__oauth_client__keycloak__claims__name
value: "preferred_username"
```
##### Setting Administrators by Group Association
In case you would like the members of the particular KeyCloak group to be set as administrators in ClearML, set the
following environment variable. Note that in this case, the KeyCloak group(s) do not have to be present in the ClearML Server.
```
- name: CLEARML__services__login__sso__oauth_client__keycloak__groups__admins
value: "[\"<the name of admin group from Keycloak>\"]"
```
##### Restrict User Signup
To prevent sign in for users who do not have a matching group(s) found using the above-mentioned configuration, set the
following environment variable.
```
- name: CLEARML__services__login__sso__oauth_client__keycloak__groups__prohibit_user_signup_if_not_in_group
value: "true"
```
### Administrator User Role Association
For integration of an admin user role from KeyCloak into the ClearML service, do the following.
#### KeyCloak Configuration
1. For each administrator user, assign the admin role to that user in KeyCloak
2. In the `Client Scopes` tab, make sure that the `roles` claim is returned in the access token or userinfo token
(this depends on the configuration in step 1)
#### ClearML Server Configuration
By default, the ClearML Server will use the admin claim to identify administrator users. To use a different group name
for designating the admin role, set the following environment variable:
```
- name: CLEARML__services__login__sso__oauth_client__keycloak__admin_role
value: "<admin claim name>"
```
#### Disabling Admin Role Association
To disable the automatic administrator claim and manage administrators solely from inside the ClearML WebApp, make sure
that user roles are not returned by KeyCloak in the auth token or the `userinfo` endpoint, and/or set the following
ClearML Server environment variable:
```
- name: CLEARML__services__login__sso__oauth_client__keycloak__admin_role
value: ""
```
### Additional ClearML Server Configurations
#### KeyCloak Session Logout
To automatically log out the user from the KeyCloak provider when the user logs out of the ClearMK service, set the
following environment variable. This will make sure that the KeyCloak session is not maintained in the browser so that
when the user tries to log into the ClearML service, the KeyCloak login page will be used again and not skipped.
```
- name: CLEARML__services__login__sso__oauth_client__keycloak__idp_logout
value: "true"
```
#### User Info Source
By default, the user info is taken from the KeyCloak access token. If you prefer to use the user info available through
the OAuths protocol `userinfo` endpoint, set the following environment variable:
```
- name: CLEARML__services__login__sso__oauth_client__keycloak__get_info_from_access_token
value: "false"
```

View File

@@ -0,0 +1,98 @@
---
title: Multi-Tenant Login Mode
---
In a multi-tenant setup, each external tenant can be represented by an SSO client defined in the customer Identity provider
(Keycloak). Each ClearML tenant can be associated with a particular external tenant. Currently, only one
ClearML tenant can be associated with a particular external tenant
## Setup IdP/SSO Client in Identity Provider
1. Add the following URL to "Valid redirect URIs": `<clearml_webapp_address>/callback_<client_id>`
2. Add the following URLs to "Valid post logout redirect URIs":
```
<clearml_webapp_address>/login
<clearml_webapp_address>/login/<external tenant ID>
```
3. Make sure the external tenant ID and groups are returned as claims for a each user
## Configure ClearML to use Multi-Tenant Mode
Set the following environment variables in the ClearML enterprise helm chart under the `apiserver` section:
* To turn on the multi-tenant login mode:
```
- name: CLEARML__services__login__sso__tenant_login
value: "true"
```
* To hide any global IdP/SSO configuration that's not associated with a specific ClearML tenant:
```
- name: CLEARML__services__login__sso__allow_settings_providers
value: "false"
```
Enable `onlyPasswordLogin` by setting the following environment variable in the helm chart under the `webserver` section:
```
- name: WEBSERVER__onlyPasswordLogin`
value: “true”`
```
## Setup IdP for a ClearML Tenant
To set an IdP client for a ClearML tenant, youll need to set the ClearML tenant settings and define an identity provider:
1. Call the following API to set the ClearML tenant settings:
```
curl $APISERVER_URL/system.update_company_sso_config -H "Content-Type: application/json" -u $APISERVER_KEY:$APISERVER_SECRET -d'{
"company": "<company_id>",
"sso": {
"tenant": "<external tenant ID>",
"group_mapping": {
"IDP group name1": "Clearml group name1",
"IDP group name2": "Clearml group name2"
},
"admin_groups": ["IDP admin group name1", "IDP admin group name2"]
}}'
```
2. Call the following API to define the ClearML tenant identity provider:
```
curl $APISERVER_URL/sso.save_provider_configuration -H "Content-Type: application/json" -u $APISERVER_KEY:$APISERVER_SECRET -d'{
"provider": "keycloak",
"company": "<company_id>",
"configuration": {
"id": "<some unique id here, you can use company_id>",
"display_name": "<The text that you want to see on the login button>",
"client_id": "<client_id from IDP>",
"client_secret": "<client secret from IDP>",
"authorization_endpoint": "<authorization_endpoint from IDP OpenID configuration>",
"token_endpoint": "<token_endpoint from IDP OpenID configuration>",
"revocation_endpoint": "<revocation_endpoint from IDP OpenID configuration>",
"end_session_endpoint": "<end_session_endpoint from IDP OpenID configuration>",
"logout_from_provider": true,
"claim_tenant": "tenant_key",
"claim_name": "name",
"group_enabled": true,
"claim_groups": "ad_groups_trusted",
"group_prohibit_user_login_if_not_in_group": true
}}'
```
The above configuration assumes the following:
* On logout from ClearML, the user is also logged out from the Identity Provider
* External tenant ID for the user is returned under the `tenant_key` claim
* User display name is returned under the `name` claim
* User groups list is returned under the `ad_groups_trusted` claim
* Group integration is turned on and a user will be allowed to log in if any of the groups s/he belongs to in the
IdP exists under the corresponding ClearML tenant (this is after group name translation is done according to the ClearML tenant settings)
## Webapp Login
When running in multi-tenant login mode, a user belonging to some external tenant should use the following link to log in:
```
<clearml_webapp_address>/login/<external tenant ID>
```

View File

@@ -0,0 +1,60 @@
---
title: Microsoft AD SAML
---
This document describes the configuration required for connecting a ClearML Kubernetes server to allow authenticating
users with Microsoft AD using SAML.
Configuration requires two steps:
* Configuration of the application in the active directory
* Configuration in the ClearML server side
## Active Directory Configuration
1. Register the ClearML app with the callback url: `<clearml_webapp_address>/callback_microsoft_ad`
1. Make sure that SSO binding is set to HTTP-Redirect
1. Make sure that the following user claims are returned to the ClearML app:
```
emailaddress - user.mail
displayname - user.displayname
Unique user identifier - user.principalname
```
1. Generate the IdP metadata file and save the file and entity ID
## ClearML Server Side Configuration
The following should be configured in the override file:
```
apiserver:
additionalConfigs:
metadata.xml: |
<?xml version="1.0"?>
<test>
<rule id="tst">
<test_name>test</test_name>
</rule>
</test>
extraEnvs:
- name: "ALLEGRO__secure__login__sso__saml_client__microsoft_ad__entity_id"
value: "<app_entity_id>"
- name: "ALLEGRO__secure__login__sso__saml_client__microsoft_ad__idp_metadata_file"
value: "/opt/clearml/config/default/metadata.xml"
- name: "ALLEGRO__secure__login__sso__saml_client__microsoft_ad__default_company"
value: "<company_id>"
- name: "CLEARML__services__login__sso__saml_client__microsoft_ad__claims__object_id"
value: "http://schemas.microsoft.com/identity/claims/objectidentifier"
- name: "CLEARML__services__login__sso__saml_client__microsoft_ad__claims__name"
value: "http://schemas.microsoft.com/identity/claims/displayname"
- name: "CLEARML__services__login__sso__saml_client__microsoft_ad__claims__email"
value: "emailAddress"
- name: "CLEARML__services__login__sso__saml_client__microsoft_ad__claims__given_name"
value: "givenName"
- name: "CLEARML__services__login__sso__saml_client__microsoft_ad__claims__surname"
value: "surname"
- name: "CLEARML__services__login__sso__saml_client__microsoft_ad__claims__email"
value: "emailAddress"
- name: "CLEARML__services__login__sso__saml_client__microsoft_ad__claims__email"
value: "emailAddress"
```

View File

@@ -0,0 +1,292 @@
---
title: AWS VPC
---
This guide provides step-by-step instructions for installing the ClearML Enterprise Server on AWS using a Virtual Private Cloud (VPC).
It covers the following:
* Set up security groups and IAM role
* Create EC2 instance with required disks
* Install dependencies and mount disks
* Deploy ClearML version using docker-compose
* Set up load balancer and DNS
* Set up server backup
## Prerequisites
An AWS account with at least 2 availability zones is required. It is recommended to install on a region with at least
3 availability zones. Having fewer than 3 availability zones would prevent the use of high-availability setups, if
needed in the future.
## Instance Setup
:::note
It is recommended to use a VPC with IPv6 enabled for future usage expansion.
:::
### Create Security Groups for the Server and Load Balancer
1. Create a security group for the load balancer.
It is recommended to configure the security group to allow access, at first, only for a trusted IP address or a set
of trusted IP addresses, that will be used for the initial setup of the server.
* Ingress TCP ports: 80, 443 from trusted IP addresses.
* Egress: All addresses and ports.
1. Create a security group for the main server (`clearml-main`):
* Ingress:
* TCP port 10000, from the load balancer's security group
* TCP port 22 from trusted IP addresses.
* Egress: All addresses and ports
:::important
A companys security policy may require filtering Egress traffic. However, at the initial stage, one should note that
some external repositories will be used to install software.
:::
### Create an IAM Role for the Server
To perform backups to S3, the instance will need a role that allows EC2 access (RW) to a backup bucket.
An example policy document with the above parameters is provided at `self_installed_policy.json`.
### Create Instance
Instance requirements:
1. The instance must be created in a VPC with at least two public subnets to allow for AWS load balancer setup.
2. `x86_64` based instance
3. [Amazon Linux 2 OS](https://aws.amazon.com/amazon-linux-2/?amazon-linux-whats-new.sort-by=item.additionalFields.postDateTime&amazon-linux-whats-new.sort-order=desc)
4. Disks:
1. Root disk: 50GB `gp3` disk, or one with higher volume/performance.
2. Data disk:
1. Used for databases (ElasticSearch and Mongo DB) in which meta-data and events are saved
2. Device: `/dev/sdf`
3. Recommended initial size: 100GB
4. Type: `gp3` or a higher random access performance one.
3. Fileserver disk:
1. Used for storing files such as debug images and models
2. Device: `/dev/sdg`
3. Recommended initial size: Should be estimated by users of the system.
4. Type: Depending on usage, but `gp3` or `st1` are usually the best options:
1. For a large amount of data, used by a small number of users/experiments, use `st1` (minimum `st1` disk size: 500GB).
2. For all other scenarios, use SSD disks (e.g. `gp3`).
3. The disk type can be changed after creation.
4. Very large number of users and/or experiments may require higher than the default `gp3` disk performance.
4. Docker data disk:
1. Used for Docker data.
2. Device: `/dev/sdh`
3. Recommended initial size: 30GB
4. Type: `gp3`
5. Use the `clearml-main` security group and the IAM role created in the previous step.
## Configuration and Software Deployment
### Install Dependencies
1. Copy the following files to `/home/ec2-user` directory on the server:
1. `envoy.yaml`
2. `self_installed_VPC_EC2_amazon_linux_2_install.sh`
2. Run `self_installed_VPC_EC2_amazon_linux_2_install.sh` from the `/home/ec2-user` directory.
3. Verify the disks were mounted successfully (using: `df -h`) to:
1. `/opt/allegro/data`
2. `/opt/allegro/data/fileserver`
3. `/var/lib/docker`
4. Reboot server.
### Version Deployment
1. Copy the following files to `/home/ec2-user` directory on the server:
* `constants.env`
* `docker-compose.yml`
* `docker-compose.override.yml`
2. Log in to Dockerhub:
```
source constants.env
sudo docker login -u=$DOCKERHUB_USER -p=$DOCKERHUB_PASSWORD
```
3. Start the dockers:
```
sudo docker-compose --env-file constants.env up -d
```
## Load Balancer
1. Create a TLS certificate:
1. Choose a domain name to be used with the server. The main URL that will be used by the systems users will be app.\<domain\>
2. Create a certificate, with the following DNS names:
1. \<domain name\>
2. \*.\<domain name\>
2. Create the `envoy` target group for the server:
1. Port: 10000
2. Protocol: HTTP
3. Target type: instance
4. Attach the server instance as the single target.
5. Health check:
1. Match HTTP response code 200
2. Path: `/api/debug.ping`
3. timeout: 10 seconds
4. Healthy threshold: 1
5. Unhealthy threshold: 2
3. Create an Application Load Balancer, with the following parameters:
1. Security group: As defined [above](#create-security-groups-for-the-server-and-load-balancer) for the load balancer
2. Subnets: Two subnets on the VPC. It is recommended to have at least one of two on the same subnet as the instance.
3. Idle timeout: 300 seconds
4. Enable deletion protection: True
5. IP address type: If possible, dualstack. Otherwise, IPv4.
6. Listeners:
1. HTTP:
1. Port: 80
2. protocol: HTTP
3. redirect (HTTP 301\) to the same address, with HTTPS
2. HTTPS:
1. port 443
2. Protocol: HTTPS
3. Certificate: As defined above.
4. SSL policy:
1. Based on your company's security policy
2. Currently recommended: `ELBSecurityPolicy-TLS13-1-2-Res-2021-06`
:::note
After setting up the listener, we recommend changing the rules created automatically: Set a default HTTP 404
response, and forwarding to the target group only if the HTTP header matches `<domain>` or `*.<domain>`
:::
4. Define DNS rules
1. Use your DNS provider of choice to forward traffic to the load balancer.
2. If using Route53, the use of A record aliases is recommended.
3. The following domains should point to the load balancer:
1. `<domain name>`
2. `*.<domain name>`
You can now change the load balancers security group to allow Internet access
## Backups
### File Server
Identify the file server's EBS volume ID on the AWS console.
On the AWS backup service:
1. Create a backup vault.
2. Create a backup plan for the EBS volume into the vault.
1. Recommended to perform at least a daily backup.
2. Recommended backups expiration of 2 days at the least.
### Elastic
#### Create the Backup Repo
1. Copy `create_elastic_backup_repo.sh` file to `/home/ec2-user` directory on the server
2. Run:
```
create_elastic_backup_repo.sh <bucket_name>
```
#### Backing Up
Backup is done by running the `elastic_backup.py` Python script periodically.
1. Copy `elastic_backup.py` to the `/home/ec2-user` directory on the server
2. Install the required packages:
```
pip3 install “elasticsearch\>=6.0.0,\<=7.17.7”
pip3 install boto3
```
3. For daily backups, run:
```
/home/ec2-user/elastic_backup.py --host-address localhost --snapshot-name-prefix clearml --backup-repo daily --delete-backups-older-than-days 7
```
4. For hourly backups run:
```
/home/ec2-user/elastic_backup.py --host-address localhost --snapshot-name-prefix clearml --backup-repo hourly --delete-backups-older-than-days 1
5. Recommended to add these to the crontab
### MongoDB
Backup is done by running the `mongo_backup.sh` script periodically.
1. Copy `mongo_backup.sh` to `/home/ec2-user` directory on the server
2. Run:
```
mongo_backup.sh <bucket name>/<prefix> (ex: mongo_backup.sh mybucket/path/in/bucket)
```
3. Recommended to add this to the crontab
:::note
The MongoDB script does not deal with deletion of old backups. It's recommended to create an S3 lifecycle rule for
deletion beyond the company's required retention period.
:::
## Monitoring
### Hardware Monitoring
#### CPU
CPU usage varies depending on system usage. We recommend to monitor CPU usage and to alert when the usage is higher
than normal. Recommended starting alerts would be 5-minute CPU load
level of 5 and 10, and adjusting according to performance.
#### RAM
Available memory usage also varies depending on system usage. Due to spikes in usage when performing certain tasks, 6-8 GB
of available RAM is recommended as the standard baseline. Some use cases may require more. Thus, we recommend to have 8 GB
of available memory on top of the regular system usage. Alert levels should alert if the available memory is below normal.
#### Disk Usage
There are several disks used by the system. We recommend monitoring all of them. Standard alert levels are 20%, 10% and
5% of free disk space.
### Service Availability
The following services should be monitored periodically for availability and for response time:
* `apiserver` - [http://localhost:10000/api/debug.ping](http://localhost:10000/api/debug.ping) should return HTTP 200
* `webserver` - [http://localhost:10000](http://localhost:10000/) should return HTTP 200
* `fileserver` - [http://localhost:10000/files/](http://localhost:10000/files/) should return HTTP 405 ("method not allowed")
### API Server Docker Memory Usage
A usage spike can happen during normal operation. But very high spikes (above 6GB) are not expected. We recommend using
`docker stats` to get this information.
For example, the following comment retrieves the API server's information from the Docker server:
```
sudo curl -s --unix-socket /var/run/docker.sock http://localhost/containers/allegro-apiserver/stats?stream=false
```
We recommend monitoring the API server memory in addition to the system's available RAM. Alerts should be triggered
when memory usage of the API server exceeds the normal behavior. A starting value can be 6 GB.
### Backup Failures
All scripts provided use exit code 0 when successfully completing the backups. Other exit codes indicate problems. The
log would usually indicate the reason for the failure.
## Maintenance
### Removing app containers
To remove old application containers, add the following to the cron:
```
0 0 * * * root docker container prune --force --filter "until=96h"
```

34
docs/deploying_models.md Normal file
View File

@@ -0,0 +1,34 @@
---
title: Model Deployment
---
Model deployment makes trained models accessible for real-world applications. ClearML provides a comprehensive suite of
tools for seamless model deployment, which supports
features including:
* Version control
* Automatic updates
* Performance monitoring
ClearML's offerings optimize the deployment process
while ensuring scalability and security. The solutions include:
* **Model Deployment UI Applications** (available under the Enterprise Plan) - The UI applications simplify deploying models
as network services through secure endpoints, providing an interface for managing deployments--no code required.
See more information about the following applications:
* [vLLM Deployment](webapp/applications/apps_model_deployment.md)
* [Embedding Model Deployment](webapp/applications/apps_embed_model_deployment.md)
* [Llama.cpp Model Deployment](webapp/applications/apps_llama_deployment.md)
* **Command-line Interface** - `clearml-serving` is a CLI for model deployment and orchestration.
It supports integration with Kubernetes clusters or custom container-based
solutions, offering flexibility for diverse infrastructure setups.
For more information, see [ClearML Serving](clearml_serving/clearml_serving.md).
## Model Endpoint Monitoring
All deployed models are displayed in a unified **Model Endpoints** list in the UI. This
allows users to monitor endpoint activity and manage deployments from a single location.
For more information, see [Model Endpoints](webapp/webapp_model_endpoints.md).
![Model Endpoints](img/webapp_model_endpoints_monitor.png#light-mode-only)
![Model Endpoints](img/webapp_model_endpoints_monitor_dark.png#dark-mode-only)

View File

@@ -17,7 +17,7 @@ from installing required packages to setting environment variables,
all leading to executing the code (supporting both virtual environment or flexible docker container configurations).
The agent also supports overriding parameter values on-the-fly without code modification, thus enabling no-code experimentation (this is also the foundation on which
ClearML [Hyperparameter Optimization](hpo.md) is implemented).
ClearML [Hyperparameter Optimization](../getting_started/hpo.md) is implemented).
An agent can be associated with specific GPUs, enabling workload distribution. For example, on a machine with 8 GPUs you
can allocate several GPUs to an agent and use the rest for a different workload, even through another agent (see [Dynamic GPU Allocation](../clearml_agent/clearml_agent_dynamic_gpus.md)).

View File

@@ -6,7 +6,7 @@ Hyperparameters are a script's configuration options. Since hyperparameters can
model performance, it is crucial to efficiently track and manage them.
ClearML supports tracking and managing hyperparameters in each task and provides a dedicated [hyperparameter
optimization module](hpo.md). With ClearML's logging and tracking capabilities, tasks can be reproduced, and their
optimization module](../getting_started/hpo.md). With ClearML's logging and tracking capabilities, tasks can be reproduced, and their
hyperparameters and results can be saved and compared, which is key to understanding model behavior.
ClearML lets you easily try out different hyperparameter values without changing your original code. ClearML's [execution

View File

@@ -124,7 +124,7 @@ Available task types are:
* *inference* - Model inference job (e.g. offline / batch model execution)
* *controller* - A task that lays out the logic for other tasks' interactions, manual or automatic (e.g. a pipeline
controller)
* *optimizer* - A specific type of controller for optimization tasks (e.g. [hyperparameter optimization](hpo.md))
* *optimizer* - A specific type of controller for optimization tasks (e.g. [hyperparameter optimization](../getting_started/hpo.md))
* *service* - Long lasting or recurring service (e.g. server cleanup, auto ingress, sync services etc.)
* *monitor* - A specific type of service for monitoring
* *application* - A task implementing custom applicative logic, like [autoscaler](../guides/services/aws_autoscaler.md)

View File

@@ -2,9 +2,9 @@
title: ClearML Modules
---
- [**ClearML Python Package**](../getting_started/ds/ds_first_steps.md#install-clearml) (`clearml`) for integrating ClearML into your existing code-base.
- [**ClearML Server**](../deploying_clearml/clearml_server.md) (`clearml-server`) for storing experiment, model, and workflow data, and supporting the Web UI experiment manager. It is also the control plane for the MLOps.
- [**ClearML Agent**](../clearml_agent.md) (`clearml-agent`), the MLOps orchestration agent. Enabling experiment and workflow reproducibility, and scalability.
- [**ClearML Python Package**](../clearml_sdk/clearml_sdk_setup.md) (`clearml`) for integrating ClearML into your existing code-base.
- [**ClearML Server**](../deploying_clearml/clearml_server.md) (`clearml-server`) for storing task, model, and workflow data, and supporting the Web UI experiment manager. It is also the control plane for the MLOps.
- [**ClearML Agent**](../clearml_agent.md) (`clearml-agent`), the MLOps orchestration agent. Enabling task and workflow reproducibility, and scalability.
- [**ClearML Data**](../clearml_data/clearml_data.md) (`clearml-data`) data management and versioning on top of file-systems/object-storage.
- [**ClearML Serving**](../clearml_serving/clearml_serving.md) (`clearml-serving`) for model deployment and orchestration.
- [**ClearML Session**](../apps/clearml_session.md) (`clearml-session`) for launching remote instances of Jupyter Notebooks and VSCode.

View File

@@ -0,0 +1,59 @@
---
title: Auto-logging Experiments
---
In ClearML, experiments are organized as [Tasks](../fundamentals/task.md).
When you integrate the ClearML SDK with your code, the ClearML task manager automatically captures:
* Source code and uncommitted changes
* Installed packages
* General information such as machine details, runtime, creation date etc.
* Model files, parameters, scalars, and plots from popular ML frameworks such as TensorFlow and PyTorch (see list of [supported frameworks](../clearml_sdk/task_sdk.md#automatic-logging))
* Console output
:::tip Automatic logging control
To control what ClearML automatically logs, see this [FAQ](../faq.md#controlling_logging).
:::
## To Auto-log Your Experiments
1. Install `clearml` and connect it to the ClearML Server (see [instructions](../clearml_sdk/clearml_sdk.md))
1. At the beginning of your code, import the `clearml` package:
```python
from clearml import Task
```
:::tip Full Automatic Logging
To ensure full automatic logging, it is recommended to import the `clearml` package at the top of your entry script.
:::
1. Initialize the Task object in your `main()` function, or the beginning of the script.
```python
task = Task.init(project_name='great project', task_name='best task')
```
If the project does not already exist, a new one is created automatically.
The console should display the following output:
```
ClearML Task: created new task id=1ca59ef1f86d44bd81cb517d529d9e5a
2021-07-25 13:59:09
ClearML results page: https://app.clear.ml/projects/4043a1657f374e9298649c6ba72ad233/experiments/1ca59ef1f86d44bd81cb517d529d9e5a/output/log
2025-01-25 13:59:16
```
1. Click the results page link to go to the [task's detail page in the ClearML WebApp](../webapp/webapp_exp_track_visual.md),
where you can monitor the task's status, view all its logged data, visualize its results, and more!
![Info panel](../img/webapp_tracking_40.png#light-mode-only)
![Info panel](../img/webapp_tracking_40_dark.png#dark-mode-only)
**That's it!** You are done integrating ClearML with your code :)
Now, [command-line arguments](../fundamentals/hyperparameters.md#tracking-hyperparameters), [console output](../fundamentals/logger.md#types-of-logged-results), TensorBoard and Matplotlib, and much more will automatically be
logged in the UI under the created Task.
Sit back, relax, and watch your models converge :)

View File

@@ -0,0 +1,25 @@
---
title: Building Pipelines
---
Pipelines are a way to streamline and connect multiple processes, plugging the output of one process as the input of another.
ClearML Pipelines are implemented by a Controller Task that holds the logic of the pipeline steps' interactions. The
execution logic controls which step to launch based on parent steps completing their execution. Depending on the
specifications laid out in the controller task, a step's parameters can be overridden, enabling users to leverage other
steps' execution products such as artifacts and parameters.
When run, the controller will sequentially launch the pipeline steps. Pipelines can be executed locally or
on any machine using the [clearml-agent](../clearml_agent.md).
ClearML pipelines are created from code using one of the following:
* [PipelineController class](../pipelines/pipelines_sdk_tasks.md) - A pythonic interface for defining and configuring the
pipeline controller and its steps. The controller and steps can be functions in your Python code or existing ClearML tasks.
* [PipelineDecorator class](../pipelines/pipelines_sdk_function_decorators.md) - A set of Python decorators which transform
your functions into the pipeline controller and steps
For more information, see [ClearML Pipelines](../pipelines/pipelines.md).
![Pipeline DAG](../img/webapp_pipeline_DAG.png#light-mode-only)
![Pipeline DAG](../img/webapp_pipeline_DAG_dark.png#dark-mode-only)

View File

@@ -0,0 +1,20 @@
---
title: Building Task Execution Environments in a Container
---
### Base Container
Build a container according to the execution environment of a specific task.
```bash
clearml-agent build --id <task-id> --docker --target <new-docker-name>
```
You can add the container as the base container image to a task, using one of the following methods:
- Using the **ClearML Web UI** - See [Default Container](../webapp/webapp_exp_tuning.md#default-container).
- In the ClearML configuration file - Use the ClearML configuration file [`agent.default_docker`](../configs/clearml_conf.md#agentdefault_docker)
options.
Check out [this tutorial](../guides/clearml_agent/exp_environment_containers.md) for building a Docker container
replicating the execution environment of an existing task.

View File

@@ -1,5 +1,5 @@
---
title: Building Docker Containers
title: Building Executable Task Containers
---
## Exporting a Task into a Standalone Docker Container
@@ -28,20 +28,3 @@ Build a Docker container that when launched executes a specific task, or a clone
Check out [this tutorial](../guides/clearml_agent/executable_exp_containers.md) for building executable task
containers.
### Base Docker Container
Build a Docker container according to the execution environment of a specific task.
```bash
clearml-agent build --id <task-id> --docker --target <new-docker-name>
```
You can add the Docker container as the base Docker image to a task, using one of the following methods:
- Using the **ClearML Web UI** - See [Default Container](../webapp/webapp_exp_tuning.md#default-container).
- In the ClearML configuration file - Use the ClearML configuration file [`agent.default_docker`](../configs/clearml_conf.md#agentdefault_docker)
options.
Check out [this tutorial](../guides/clearml_agent/exp_environment_containers.md) for building a Docker container
replicating the execution environment of an existing task.

View File

@@ -1,6 +1,7 @@
---
title: Scheduling Working Hours
title: Managing Agent Work Schedules
---
:::important Enterprise Feature
This feature is available under the ClearML Enterprise plan.
:::

View File

@@ -0,0 +1,131 @@
---
title: Managing Your Data
---
Data is probably one of the biggest factors that determines the success of a project. Associating a model's data with
the model's configuration, code, and results (such as accuracy) is key to deducing meaningful insights into model behavior.
[ClearML Data](../clearml_data/clearml_data.md) lets you:
* Version your data
* Fetch your data from every machine with minimal code changes
* Use the data with any other task
* Associate data to task results.
ClearML offers the following data management solutions:
* `clearml.Dataset` - A Python interface for creating, retrieving, managing, and using datasets. See [SDK](../clearml_data/clearml_data_sdk.md)
for an overview of the basic methods of the Dataset module.
* `clearml-data` - A CLI utility for creating, uploading, and managing datasets. See [CLI](../clearml_data/clearml_data_cli.md)
for a reference of `clearml-data` commands.
* Hyper-Datasets - ClearML's advanced queryable dataset management solution. For more information, see [Hyper-Datasets](../hyperdatasets/overview.md)
The following guide will use both the `clearml-data` CLI and the `Dataset` class to do the following:
1. Create a ClearML dataset
2. Access the dataset from a ClearML Task in order to preprocess the data
3. Create a new version of the dataset with the modified data
4. Use the new version of the dataset to train a model
## Creating Dataset
Let's assume you have some code that extracts data from a production database into a local folder.
Your goal is to create an immutable copy of the data to be used by further steps.
1. Create the dataset using the `clearml-data create` command and passing the dataset's project and name. You can add a
`latest` tag, making it easier to find it later.
```bash
clearml-data create --project chatbot_data --name dataset_v1 --latest
```
1. Add data to the dataset using `clearml-data sync` and passing the path of the folder to be added to the dataset.
This command also uploads the data and finalizes the dataset automatically.
```bash
clearml-data sync --folder ./work_dataset
```
## Preprocessing Data
The second step is to preprocess the data. First access the data, then modify it,
and lastly create a new version of the data.
1. Create a task for you data preprocessing (not required):
```python
from clearml import Task, Dataset
# create a task for the data processing
task = Task.init(project_name='data', task_name='create', task_type='data_processing')
```
1. Access a dataset using [`Dataset.get()`](../references/sdk/dataset.md#datasetget):
```python
# get the v1 dataset
dataset = Dataset.get(dataset_project='data', dataset_name='dataset_v1')
```
1. Get a local mutable copy of the dataset using [`Dataset.get_mutable_local_copy`](../references/sdk/dataset.md#get_mutable_local_copy). \
This downloads the dataset to a specified `target_folder` (non-cached). If the folder already has contents, specify
whether to overwrite its contents with the dataset contents using the `overwrite` parameter.
```python
# get a local mutable copy of the dataset
dataset_folder = dataset.get_mutable_local_copy(
target_folder='work_dataset',
overwrite=True
)
```
1. Preprocess the data, including modifying some files in the `./work_dataset` folder.
1. Create a new version of the dataset:
```python
# create a new version of the dataset with the pickle file
new_dataset = Dataset.create(
dataset_project='data',
dataset_name='dataset_v2',
parent_datasets=[dataset],
# this will make sure we have the creation code and the actual dataset artifacts on the same Task
use_current_task=True,
)
1. Add the modified data to the dataset:
```python
new_dataset.sync_folder(local_path=dataset_folder)
new_dataset.upload()
new_dataset.finalize()
```
1. Remove the `latest` tag from the previous dataset and add the tag to the new dataset:
```python
# now let's remove the previous dataset tag
dataset.tags = []
new_dataset.tags = ['latest']
```
The new dataset inherits the contents of the datasets specified in `Dataset.create`'s `parent_datasets` argument.
This not only helps trace back dataset changes with full genealogy, but also makes the storage more efficient,
since it only stores the changed and/or added files from the parent versions.
When you access the dataset, it automatically merges the files from all parent versions
in a fully automatic and transparent process, as if the files were always part of the requested Dataset.
## Training
You can now train your model with the **latest** dataset you have in the system, by getting the instance of the Dataset
based on the `latest` tag (if you have two Datasets with the same tag you will get the newest).
Once you have the dataset you can request a local copy of the data. All local copy requests are cached,
which means that if you access the same dataset multiple times you will not have any unnecessary downloads.
```python
# create a task for the model training
task = Task.init(project_name='data', task_name='ingest', task_type='training')
# get the latest dataset with the tag `latest`
dataset = Dataset.get(dataset_tags='latest')
# get a cached copy of the Dataset files
dataset_folder = dataset.get_local_copy()
# train model here
```

View File

@@ -1,193 +0,0 @@
---
title: Next Steps
---
So, you've already [installed ClearML's Python package](ds_first_steps.md) and run your first experiment!
Now, you'll learn how to track Hyperparameters, Artifacts, and Metrics!
## Accessing Experiments
Every previously executed experiment is stored as a Task.
A Task's project and name can be changed after the experiment has been executed.
A Task is also automatically assigned an auto-generated unique identifier (UUID string) that cannot be changed and always locates the same Task in the system.
Retrieve a Task object programmatically by querying the system based on either the Task ID,
or project and name combination. You can also query tasks based on their properties, like tags (see [Querying Tasks](../../clearml_sdk/task_sdk.md#querying--searching-tasks)).
```python
prev_task = Task.get_task(task_id='123456deadbeef')
```
Once you have a Task object you can query the state of the Task, get its model(s), scalars, parameters, etc.
## Log Hyperparameters
For full reproducibility, it's paramount to save hyperparameters for each experiment. Since hyperparameters can have substantial impact
on model performance, saving and comparing these between experiments is sometimes the key to understanding model behavior.
ClearML supports logging `argparse` module arguments out of the box, so once ClearML is integrated into the code, it automatically logs all parameters provided to the argument parser.
You can also log parameter dictionaries (very useful when parsing an external configuration file and storing as a dict object),
whole configuration files, or even custom objects or [Hydra](https://hydra.cc/docs/intro/) configurations!
```python
params_dictionary = {'epochs': 3, 'lr': 0.4}
task.connect(params_dictionary)
```
See [Configuration](../../clearml_sdk/task_sdk.md#configuration) for all hyperparameter logging options.
## Log Artifacts
ClearML lets you easily store the output products of an experiment - Model snapshot / weights file, a preprocessing of your data, feature representation of data and more!
Essentially, artifacts are files (or Python objects) uploaded from a script and are stored alongside the Task.
These artifacts can be easily accessed by the web UI or programmatically.
Artifacts can be stored anywhere, either on the ClearML server, or any object storage solution or shared folder.
See all [storage capabilities](../../integrations/storage.md).
### Adding Artifacts
Upload a local file containing the preprocessed results of the data:
```python
task.upload_artifact(name='data', artifact_object='/path/to/preprocess_data.csv')
```
You can also upload an entire folder with all its content by passing the folder (the folder will be zipped and uploaded as a single zip file).
```python
task.upload_artifact(name='folder', artifact_object='/path/to/folder/')
```
Lastly, you can upload an instance of an object; Numpy/Pandas/PIL Images are supported with `npz`/`csv.gz`/`jpg` formats accordingly.
If the object type is unknown, ClearML pickles it and uploads the pickle file.
```python
numpy_object = np.eye(100, 100)
task.upload_artifact(name='features', artifact_object=numpy_object)
```
For more artifact logging options, see [Artifacts](../../clearml_sdk/task_sdk.md#artifacts).
### Using Artifacts
Logged artifacts can be used by other Tasks, whether it's a pre-trained Model or processed data.
To use an artifact, first you have to get an instance of the Task that originally created it,
then you either download it and get its path, or get the artifact object directly.
For example, using a previously generated preprocessed data.
```python
preprocess_task = Task.get_task(task_id='preprocessing_task_id')
local_csv = preprocess_task.artifacts['data'].get_local_copy()
```
`task.artifacts` is a dictionary where the keys are the artifact names, and the returned object is the artifact object.
Calling `get_local_copy()` returns a local cached copy of the artifact. Therefore, next time you execute the code, you don't
need to download the artifact again.
Calling `get()` gets a deserialized pickled object.
Check out the [artifacts retrieval](https://github.com/clearml/clearml/blob/master/examples/reporting/artifacts_retrieval.py) example code.
### Models
Models are a special kind of artifact.
Models created by popular frameworks (such as PyTorch, TensorFlow, Scikit-learn) are automatically logged by ClearML.
All snapshots are automatically logged. In order to make sure you also automatically upload the model snapshot (instead of saving its local path),
pass a storage location for the model files to be uploaded to.
For example, upload all snapshots to an S3 bucket:
```python
task = Task.init(
project_name='examples',
task_name='storing model',
output_uri='s3://my_models/'
)
```
Now, whenever the framework (TensorFlow/Keras/PyTorch etc.) stores a snapshot, the model file is automatically uploaded to the bucket to a specific folder for the experiment.
Loading models by a framework is also logged by the system; these models appear in an experiment's **Artifacts** tab,
under the "Input Models" section.
Check out model snapshots examples for [TensorFlow](https://github.com/clearml/clearml/blob/master/examples/frameworks/tensorflow/tensorflow_mnist.py),
[PyTorch](https://github.com/clearml/clearml/blob/master/examples/frameworks/pytorch/pytorch_mnist.py),
[Keras](https://github.com/clearml/clearml/blob/master/examples/frameworks/keras/keras_tensorboard.py),
[scikit-learn](https://github.com/clearml/clearml/blob/master/examples/frameworks/scikit-learn/sklearn_joblib_example.py).
#### Loading Models
Loading a previously trained model is quite similar to loading artifacts.
```python
prev_task = Task.get_task(task_id='the_training_task')
last_snapshot = prev_task.models['output'][-1]
local_weights_path = last_snapshot.get_local_copy()
```
Like before, you have to get the instance of the task training the original weights files, then you can query the task for its output models (a list of snapshots), and get the latest snapshot.
:::note
Using TensorFlow, the snapshots are stored in a folder, meaning the `local_weights_path` will point to a folder containing your requested snapshot.
:::
As with artifacts, all models are cached, meaning the next time you run this code, no model needs to be downloaded.
Once one of the frameworks will load the weights file, the running task will be automatically updated with "Input Model" pointing directly to the original training Task's Model.
This feature lets you easily get a full genealogy of every trained and used model by your system!
## Log Metrics
Full metrics logging is the key to finding the best performing model!
By default, ClearML automatically captures and logs everything reported to TensorBoard and Matplotlib.
Since not all metrics are tracked that way, you can also manually report metrics using a [`Logger`](../../fundamentals/logger.md) object.
You can log everything, from time series data and confusion matrices to HTML, Audio, and Video, to custom plotly graphs! Everything goes!
![Experiment plots](../../img/report_plotly.png#light-mode-only)
![Experiment plots](../../img/report_plotly_dark.png#dark-mode-only)
Once everything is neatly logged and displayed, use the [comparison tool](../../webapp/webapp_exp_comparing.md) to find the best configuration!
## Track Experiments
The task table is a powerful tool for creating dashboards and views of your own projects, your team's projects, or the entire development.
![Task table](../../img/webapp_experiment_table.png#light-mode-only)
![Task table](../../img/webapp_experiment_table_dark.png#dark-mode-only)
### Creating Leaderboards
Customize the [task table](../../webapp/webapp_exp_table.md) to fit your own needs, adding desired views of parameters, metrics, and tags.
You can filter and sort based on parameters and metrics, so creating custom views is simple and flexible.
Create a dashboard for a project, presenting the latest Models and their accuracy scores, for immediate insights.
It can also be used as a live leaderboard, showing the best performing experiments' status, updated in real time.
This is helpful to monitor your projects' progress, and to share it across the organization.
Any page is sharable by copying the URL from the address bar, allowing you to bookmark leaderboards or to send an exact view of a specific experiment or a comparison page.
You can also tag Tasks for visibility and filtering allowing you to add more information on the execution of the experiment.
Later you can search based on task name in the search bar, and filter experiments based on their tags, parameters, status, and more.
## What's Next?
This covers the basics of ClearML! Running through this guide you've learned how to log Parameters, Artifacts and Metrics!
If you want to learn more look at how we see the data science process in our [best practices](best_practices.md) page,
or check these pages out:
- Scale you work and deploy [ClearML Agents](../../clearml_agent.md)
- Develop on remote machines with [ClearML Session](../../apps/clearml_session.md)
- Structure your work and put it into [Pipelines](../../pipelines/pipelines.md)
- Improve your experiments with [Hyperparameter Optimization](../../fundamentals/hpo.md)
- Check out ClearML's integrations with your favorite ML frameworks like [TensorFlow](../../integrations/tensorflow.md),
[PyTorch](../../integrations/pytorch.md), [Keras](../../integrations/keras.md),
and more
## YouTube Playlist
All these tips and tricks are also covered in ClearML's **Getting Started** series on YouTube. Go check it out :)
[![Watch the video](https://img.youtube.com/vi/kyOfwVg05EM/hqdefault.jpg)](https://www.youtube.com/watch?v=kyOfwVg05EM&list=PLMdIlCuMqSTnoC45ME5_JnsJX0zWqDdlO&index=3)

View File

@@ -0,0 +1,34 @@
---
title: Hyperparameter Optimization
---
## What is Hyperparameter Optimization?
Hyperparameters are variables that directly control the behaviors of training algorithms, and have a significant effect on
the performance of the resulting machine learning models. Hyperparameter optimization (HPO) is crucial for improving
model performance and generalization.
Finding the hyperparameter values that yield the best performing models can be complicated. Manually adjusting
hyperparameters over the course of many training trials can be slow and tedious. Luckily, ClearML offers automated
solutions to boost hyperparameter optimization efficiency.
## Workflow
![Hyperparameter optimization diagram](../img/hpo_diagram.png)
The preceding diagram demonstrates the typical flow of hyperparameter optimization where the parameters of a base task are optimized:
1. Configure an Optimization Task with a base task whose parameters will be optimized, optimization targets, and a set of parameter values to
test
1. Clone the base task. Each clone's parameter is overridden with a value from the optimization task
1. Enqueue each clone for execution by a ClearML Agent
1. The Optimization Task records and monitors the cloned tasks' configuration and execution details, and returns a
summary of the optimization results.
## ClearML Solutions
ClearML offers three solutions for hyperparameter optimization:
* [GUI application](../webapp/applications/apps_hpo.md): The Hyperparameter Optimization app allows you to run and manage the optimization tasks
directly from the web interface--no code necessary (available under the ClearML Pro plan).
* [Command-Line Interface (CLI)](../apps/clearml_param_search.md): The `clearml-param-search` CLI tool enables you to configure and launch the optimization process from your terminal.
* [Python Interface](../clearml_sdk/hpo_sdk.md): The `HyperParameterOptimizer` class within the ClearML SDK allows you to
configure and launch optimization tasks, and seamlessly integrate them in your existing model training tasks.

View File

@@ -0,0 +1,122 @@
---
title: Logging and Using Task Artifacts
---
:::note
This tutorial assumes that you've already set up [ClearML](../clearml_sdk/clearml_sdk_setup.md)
:::
ClearML lets you easily store a task's output products--or **Artifacts**:
* [Model](#models) snapshot / weights file
* Preprocessing of your data
* Feature representation of data
* And more!
**Artifacts** are files or Python objects that are uploaded and stored alongside the Task.
These artifacts can be easily accessed by the web UI or programmatically.
Artifacts can be stored anywhere, either on the ClearML Server, or any object storage solution or shared folder.
See all [storage capabilities](../integrations/storage.md).
## Adding Artifacts
Let's create a [Task](../fundamentals/task.md) and add some artifacts to it.
1. Create a task using [`Task.init()`](../references/sdk/task.md#taskinit)
```python
from clearml import Task
task = Task.init(project_name='great project', task_name='task with artifacts')
```
1. Upload a local **file** using [`Task.upload_folder()`](../references/sdk/task.md#upload_artifact) and specifying the artifact's
name and its path:
```python
task.upload_artifact(name='data', artifact_object='/path/to/preprocess_data.csv')
```
1. Upload an **entire folder** with all its content by passing the folder path (the folder will be zipped and uploaded as a single zip file).
```python
task.upload_artifact(name='folder', artifact_object='/path/to/folder/')
```
1. Upload an instance of an object. Numpy/Pandas/PIL Images are supported with `npz`/`csv.gz`/`jpg` formats accordingly.
If the object type is unknown, ClearML pickles it and uploads the pickle file.
```python
numpy_object = np.eye(100, 100)
task.upload_artifact(name='features', artifact_object=numpy_object)
```
For more artifact logging options, see [Artifacts](../clearml_sdk/task_sdk.md#artifacts).
### Using Artifacts
Logged artifacts can be used by other Tasks, whether it's a pre-trained Model or processed data.
To use an artifact, first you have to get an instance of the Task that originally created it,
then you either download it and get its path, or get the artifact object directly.
For example, using a previously generated preprocessed data.
```python
preprocess_task = Task.get_task(task_id='preprocessing_task_id')
local_csv = preprocess_task.artifacts['data'].get_local_copy()
```
`task.artifacts` is a dictionary where the keys are the artifact names, and the returned object is the artifact object.
Calling `get_local_copy()` returns a local cached copy of the artifact. Therefore, next time you execute the code, you don't
need to download the artifact again.
Calling `get()` gets a deserialized pickled object.
Check out the [artifacts retrieval](https://github.com/clearml/clearml/blob/master/examples/reporting/artifacts_retrieval.py) example code.
## Models
Models are a special kind of artifact.
Models created by popular frameworks (such as PyTorch, TensorFlow, Scikit-learn) are automatically logged by ClearML.
All snapshots are automatically logged. In order to make sure you also automatically upload the model snapshot (instead of saving its local path),
pass a storage location for the model files to be uploaded to.
For example, upload all snapshots to an S3 bucket:
```python
task = Task.init(
project_name='examples',
task_name='storing model',
output_uri='s3://my_models/'
)
```
Now, whenever the framework (TensorFlow/Keras/PyTorch etc.) stores a snapshot, the model file is automatically uploaded to the bucket to a specific folder for the task.
Loading models by a framework is also logged by the system; these models appear in a task's **Artifacts** tab,
under the "Input Models" section.
Check out model snapshots examples for [TensorFlow](https://github.com/clearml/clearml/blob/master/examples/frameworks/tensorflow/tensorflow_mnist.py),
[PyTorch](https://github.com/clearml/clearml/blob/master/examples/frameworks/pytorch/pytorch_mnist.py),
[Keras](https://github.com/clearml/clearml/blob/master/examples/frameworks/keras/keras_tensorboard.py),
[scikit-learn](https://github.com/clearml/clearml/blob/master/examples/frameworks/scikit-learn/sklearn_joblib_example.py).
### Loading Models
Loading a previously trained model is quite similar to loading artifacts.
```python
prev_task = Task.get_task(task_id='the_training_task')
last_snapshot = prev_task.models['output'][-1]
local_weights_path = last_snapshot.get_local_copy()
```
Like before, you have to get the instance of the task training the original weights files, then you can query the task for its output models (a list of snapshots), and get the latest snapshot.
:::note
Using TensorFlow, the snapshots are stored in a folder, meaning the `local_weights_path` will point to a folder containing your requested snapshot.
:::
As with artifacts, all models are cached, meaning the next time you run this code, no model needs to be downloaded.
Once one of the frameworks will load the weights file, the running task will be automatically updated with "Input Model" pointing directly to the original training Task's Model.
This feature lets you easily get a full genealogy of every trained and used model by your system!

View File

@@ -1,8 +1,4 @@
---
id: main
title: What is ClearML?
slug: /
---
# What is ClearML?
ClearML is an open-source, end-to-end AI Platform designed to streamline AI adoption and the entire development lifecycle.
It supports every phase of AI development, from research to production, allowing users to
@@ -109,14 +105,14 @@ Want a more in depth introduction to ClearML? Choose where you want to get start
- [Track and upload](../fundamentals/task.md) metrics and models with only 2 lines of code
- [Reproduce](../webapp/webapp_exp_reproducing.md) tasks with 3 mouse clicks
- [Create bots](../guides/services/slack_alerts.md) that send you Slack messages based on experiment behavior (for example,
- [Create bots](../guides/services/slack_alerts.md) that send you Slack messages based on task behavior (for example,
alert you whenever your model improves in accuracy)
- Manage your [data](../clearml_data/clearml_data.md) - store, track, and version control
- Remotely execute experiments on any compute resource you have available with [ClearML Agent](../clearml_agent.md)
- Remotely execute tasks on any compute resource you have available with [ClearML Agent](../clearml_agent.md)
- Automatically scale cloud instances according to your resource needs with ClearML's
[AWS Autoscaler](../webapp/applications/apps_aws_autoscaler.md) and [GCP Autoscaler](../webapp/applications/apps_gcp_autoscaler.md)
GUI applications
- Run [hyperparameter optimization](../fundamentals/hpo.md)
- Run [hyperparameter optimization](hpo.md)
- Build [pipelines](../pipelines/pipelines.md) from code
- Much more!

View File

@@ -1,225 +0,0 @@
---
title: First Steps
---
:::note
This tutorial assumes that you've already [signed up](https://app.clear.ml) to ClearML
:::
ClearML provides tools for **automation**, **orchestration**, and **tracking**, all key in performing effective MLOps and LLMOps.
Effective MLOps and LLMOps rely on the ability to scale work beyond one's own computer. Moving from your own machine can be time-consuming.
Even assuming that you have all the drivers and applications installed, you still need to manage multiple Python environments
for different packages / package versions, or worse - manage different Dockers for different package versions.
Not to mention, when working on remote machines, executing experiments, tracking what's running where, and making sure machines
are fully utilized at all times become daunting tasks.
This can create overhead that derails you from your core work!
ClearML Agent was designed to deal with such issues and more! It is a tool responsible for executing tasks on remote machines: on-premises or in the cloud! ClearML Agent provides the means to reproduce and track tasks in your
machine of choice through the ClearML WebApp with no need for additional code.
The agent will set up the environment for a specific Task's execution (inside a Docker, or bare-metal), install the
required Python packages, and execute and monitor the process.
## Set up an Agent
1. Install the agent:
```bash
pip install clearml-agent
```
1. Connect the agent to the server by [creating credentials](https://app.clear.ml/settings/workspace-configuration), then run this:
```bash
clearml-agent init
```
:::note
If you've already created credentials, you can copy-paste the default agent section from [here](https://github.com/clearml/clearml-agent/blob/master/docs/clearml.conf#L15) (this is optional. If the section is not provided the default values will be used)
:::
1. Start the agent's daemon and assign it to a [queue](../../fundamentals/agents_and_queues.md#what-is-a-queue):
```bash
clearml-agent daemon --queue default
```
A queue is an ordered list of Tasks that are scheduled for execution. The agent will pull Tasks from its assigned
queue (`default` in this case), and execute them one after the other. Multiple agents can listen to the same queue
(or even multiple queues), but only a single agent will pull a Task to be executed.
:::tip Agent Deployment Modes
ClearML Agents can be deployed in:
* [Virtual environment mode](../../clearml_agent/clearml_agent_execution_env.md): Agent creates a new venv to execute a task.
* [Docker mode](../../clearml_agent/clearml_agent_execution_env.md#docker-mode): Agent executes a task inside a
Docker container.
For more information, see [Running Modes](../../fundamentals/agents_and_queues.md#running-modes).
:::
## Clone a Task
Tasks can be reproduced (cloned) for validation or as a baseline for further experimentation.
Cloning a task duplicates the task's configuration, but not its outputs.
**To clone a task in the ClearML WebApp:**
1. Click on any project card to open its [task table](../../webapp/webapp_exp_table.md).
1. Right-click one of the tasks on the table.
1. Click **Clone** in the context menu, which will open a **CLONE TASK** window.
1. Click **CLONE** in the window.
The newly cloned task will appear and its info panel will slide open. The cloned task is in draft mode, so
it can be modified. You can edit the Git / code references, control the Python packages to be installed, specify the
Docker container image to be used, or change the hyperparameters and configuration files. See [Modifying Tasks](../../webapp/webapp_exp_tuning.md#modifying-tasks) for more information about editing tasks in the UI.
## Enqueue a Task
Once you have set up a task, it is now time to execute it.
**To execute a task through the ClearML WebApp:**
1. Right-click your draft task (the context menu is also available through the <img src="/docs/latest/icons/ico-bars-menu.svg" alt="Menu" className="icon size-md space-sm" />
button on the top right of the task's info panel)
1. Click **ENQUEUE,** which will open the **ENQUEUE TASK** window
1. In the window, select `default` in the queue menu
1. Click **ENQUEUE**
This action pushes the task into the `default` queue. The task's status becomes *Pending* until an agent
assigned to the queue fetches it, at which time the task's status becomes *Running*. The agent executes the
task, and the task can be [tracked and its results visualized](../../webapp/webapp_exp_track_visual.md).
## Programmatic Interface
The cloning, modifying, and enqueuing actions described above can also be performed programmatically.
### First Steps
#### Access Previously Executed Tasks
All Tasks in the system can be accessed through their unique Task ID, or based on their properties using the [`Task.get_task`](../../references/sdk/task.md#taskget_task)
method. For example:
```python
from clearml import Task
executed_task = Task.get_task(task_id='aabbcc')
```
Once a specific Task object has been obtained, it can be cloned, modified, and more. See [Advanced Usage](#advanced-usage).
#### Clone a Task
To duplicate a task, use the [`Task.clone`](../../references/sdk/task.md#taskclone) method, and input either a
Task object or the Task's ID as the `source_task` argument.
```python
cloned_task = Task.clone(source_task=executed_task)
```
#### Enqueue a Task
To enqueue the task, use the [`Task.enqueue`](../../references/sdk/task.md#taskenqueue) method, and input the Task object
with the `task` argument, and the queue to push the task into with `queue_name`.
```python
Task.enqueue(task=cloned_task, queue_name='default')
```
### Advanced Usage
Before execution, use a variety of programmatic methods to manipulate a task object.
#### Modify Hyperparameters
[Hyperparameters](../../fundamentals/hyperparameters.md) are an integral part of Machine Learning code as they let you
control the code without directly modifying it. Hyperparameters can be added from anywhere in your code, and ClearML supports multiple ways to obtain them!
Users can programmatically change cloned tasks' parameters.
For example:
```python
from clearml import Task
cloned_task = Task.clone(task_id='aabbcc')
cloned_task.set_parameter(name='internal/magic', value=42)
```
#### Report Artifacts
Artifacts are files created by your task. Users can upload [multiple types of data](../../clearml_sdk/task_sdk.md#logging-artifacts),
objects and files to a task anywhere from code.
```python
import numpy as np
from clearml import Task
Task.current_task().upload_artifact(name='a_file', artifact_object='local_file.bin')
Task.current_task().upload_artifact(name='numpy', artifact_object=np.ones(4,4))
```
Artifacts serve as a great way to pass and reuse data between tasks. Artifacts can be [retrieved](../../clearml_sdk/task_sdk.md#using-artifacts)
by accessing the Task that created them. These artifacts can be modified and uploaded to other tasks.
```python
from clearml import Task
executed_task = Task.get_task(task_id='aabbcc')
# artifact as a file
local_file = executed_task.artifacts['file'].get_local_copy()
# artifact as object
a_numpy = executed_task.artifacts['numpy'].get()
```
By facilitating the communication of complex objects between tasks, artifacts serve as the foundation of ClearML's [Data Management](../../clearml_data/clearml_data.md)
and [pipeline](../../pipelines/pipelines.md) solutions.
#### Log Models
Logging models into the model repository is the easiest way to integrate the development process directly with production.
Any model stored by a supported framework (Keras / TensorFlow / PyTorch / Joblib etc.) will be automatically logged into ClearML.
ClearML also supports methods to explicitly log models. Models can be automatically stored on a preferred storage medium
(S3 bucket, Google storage, etc.).
#### Log Metrics
Log as many metrics as you want from your processes using the [Logger](../../fundamentals/logger.md) module. This
improves the visibility of your processes' progress.
```python
from clearml import Logger
Logger.current_logger().report_scalar(
graph='metric',
series='variant',
value=13.37,
iteration=counter
)
```
You can also retrieve reported scalars for programmatic analysis:
```python
from clearml import Task
executed_task = Task.get_task(task_id='aabbcc')
# get a summary of the min/max/last value of all reported scalars
min_max_values = executed_task.get_last_scalar_metrics()
# get detailed graphs of all scalars
full_scalars = executed_task.get_reported_scalars()
```
#### Query Tasks
You can also search and query Tasks in the system. Use the [`Task.get_tasks`](../../references/sdk/task.md#taskget_tasks)
class method to retrieve Task objects and filter based on the specific values of the Task - status, parameters, metrics and more!
```python
from clearml import Task
tasks = Task.get_tasks(
project_name='examples',
task_name='partial_name_match',
task_filter={'status': 'in_progress'}
)
```
#### Manage Your Data
Data is probably one of the biggest factors that determines the success of a project. Associating a model's data with
the model's configuration, code, and results (such as accuracy) is key to deducing meaningful insights into model behavior.
[ClearML Data](../../clearml_data/clearml_data.md) lets you version your data, so it's never lost, fetch it from every
machine with minimal code changes, and associate data to task results.
Logging data can be done via command line, or programmatically. If any preprocessing code is involved, ClearML logs it
as well! Once data is logged, it can be used by other tasks.

View File

@@ -1,121 +0,0 @@
---
title: Next Steps
---
Once Tasks are defined and in the ClearML system, they can be chained together to create Pipelines.
Pipelines provide users with a greater level of abstraction and automation, with Tasks running one after the other.
Tasks can interface with other Tasks in the pipeline and leverage other Tasks' work products.
The sections below describe the following scenarios:
* [Dataset creation](#dataset-creation)
* Data [processing](#preprocessing-data) and [consumption](#training)
* [Pipeline building](#building-the-pipeline)
## Building Tasks
### Dataset Creation
Let's assume you have some code that extracts data from a production database into a local folder.
Your goal is to create an immutable copy of the data to be used by further steps:
```bash
clearml-data create --project data --name dataset
clearml-data sync --folder ./from_production
```
You can add a tag `latest` to the Dataset, marking it as the latest version.
### Preprocessing Data
The second step is to preprocess the data. First access the data, then modify it,
and lastly create a new version of the data.
```python
from clearml import Task, Dataset
# create a task for the data processing part
task = Task.init(project_name='data', task_name='create', task_type='data_processing')
# get the v1 dataset
dataset = Dataset.get(dataset_project='data', dataset_name='dataset_v1')
# get a local mutable copy of the dataset
dataset_folder = dataset.get_mutable_local_copy(
target_folder='work_dataset',
overwrite=True
)
# change some files in the `./work_dataset` folder
# create a new version of the dataset with the pickle file
new_dataset = Dataset.create(
dataset_project='data',
dataset_name='dataset_v2',
parent_datasets=[dataset],
# this will make sure we have the creation code and the actual dataset artifacts on the same Task
use_current_task=True,
)
new_dataset.sync_folder(local_path=dataset_folder)
new_dataset.upload()
new_dataset.finalize()
# now let's remove the previous dataset tag
dataset.tags = []
new_dataset.tags = ['latest']
```
The new dataset inherits the contents of the datasets specified in `Dataset.create`'s `parent_datasets` argument.
This not only helps trace back dataset changes with full genealogy, but also makes the storage more efficient,
since it only stores the changed and/or added files from the parent versions.
When you access the Dataset, it automatically merges the files from all parent versions
in a fully automatic and transparent process, as if the files were always part of the requested Dataset.
### Training
You can now train your model with the **latest** Dataset you have in the system, by getting the instance of the Dataset
based on the `latest` tag
(if by any chance you have two Datasets with the same tag you will get the newest).
Once you have the dataset you can request a local copy of the data. All local copy requests are cached,
which means that if you access the same dataset multiple times you will not have any unnecessary downloads.
```python
# create a task for the model training
task = Task.init(project_name='data', task_name='ingest', task_type='training')
# get the latest dataset with the tag `latest`
dataset = Dataset.get(dataset_tags='latest')
# get a cached copy of the Dataset files
dataset_folder = dataset.get_local_copy()
# train our model here
```
## Building the Pipeline
Now that you have the data creation step, and the data training step, create a pipeline that when executed,
will first run the first and then run the second.
It is important to remember that pipelines are Tasks by themselves and can also be automated by other pipelines (i.e. pipelines of pipelines).
```python
from clearml import PipelineController
pipe = PipelineController(
project='data',
name='pipeline demo',
version="1.0"
)
pipe.add_step(
name='step 1 data',
base_project_name='data',
base_task_name='create'
)
pipe.add_step(
name='step 2 train',
parents=['step 1 data', ],
base_project_name='data',
base_task_name='ingest'
)
```
You can also pass the parameters from one step to the other (for example `Task.id`).
In addition to pipelines made up of Task steps, ClearML also supports pipelines consisting of function steps. For more
information, see the [full pipeline documentation](../../pipelines/pipelines.md).

View File

@@ -0,0 +1,43 @@
---
title: Monitoring Project Progress
---
ClearML provides a comprehensive set of monitoring tools to help effectively track and manage machine learning projects.
These tools offer both high-level overviews and detailed insights into task execution, resource
utilization, and project performance.
## Offerings
### Project Dashboard
:::info Pro Plan Offering
The Project Dashboard app is available under the ClearML Pro plan.
:::
The [**Project Dashboard**](../webapp/applications/apps_dashboard.md) UI application provides a centralized
view of project progress, task statuses, resource usage, and key performance metrics. It offers:
* Comprehensive insights:
* Track task statuses and trends over time.
* Monitor GPU utilization and worker activity.
* Analyze performance metrics.
* Proactive alerts - By integrating with Slack, the Dashboard can notify teams of task failures
and completions.
For more information, see [Project Dashboard](../webapp/applications/apps_dashboard.md).
![Project Dashboard](../img/apps_dashboard.png#light-mode-only)
![Project Dashboard](../img/apps_dashboard_dark.png#dark-mode-only)
### Project Overview
A project's **OVERVIEW** tab in the UI presents a general picture of a project:
* Metric Snapshot A graphical representation of selected metric values across project tasks, offering a quick assessment of progress.
* Task Status Tracking When a single metric variant is selected for the snapshot, task status is color-coded (e.g.,
Completed, Aborted, Published, Failed) for better visibility.
Use the Metric Snapshot to track project progress and identify trends in task performance.
For more information, see [Project Overview](../webapp/webapp_project_overview.md).
![Project Overview](../img/webapp_project_overview.png#light-mode-only)
![Project Overview](../img/webapp_project_overview_dark.png#dark-mode-only)

View File

@@ -0,0 +1,84 @@
---
title: Remote Execution
---
:::note
This guide assumes that you've already set up [ClearML](../clearml_sdk/clearml_sdk_setup.md) and [ClearML Agent](../clearml_agent/clearml_agent_setup.md).
:::
ClearML Agent enables seamless remote execution by offloading computations from a local development environment to a more
powerful remote machine. This is useful for:
* Running initial process (a task or function) locally before scaling up.
* Offloading resource-intensive tasks to dedicated compute nodes.
* Managing execution through ClearML's queue system.
This guide focuses on transitioning a locally executed process to a remote machine for scalable execution. To learn how
to reproduce a previously executed process on a remote machine, see [Reproducing Tasks](reproduce_tasks.md).
## Running a Task Remotely
A compelling workflow is:
1. Run code on a development machine for a few iterations, or just set up the environment.
1. Move the execution to a beefier remote machine for the actual training.
Use [`Task.execute_remotely()`](../references/sdk/task.md#execute_remotely) to implement this workflow. This method stops the current manual execution, and then
re-runs it on a remote machine.
1. Deploy a `clearml-agent` from your beefier remote machine and assign it to the `default` queue:
```commandline
clearml-agent daemon --queue default
```
1. Run the local code to send to the remote machine for execution:
```python
from clearml import Task
task = Task.init(project_name="myProject", task_name="myTask")
# training code
task.execute_remotely(
queue_name='default',
clone=False,
exit_process=True
)
```
Once `execute_remotely()` is called on the machine, it stops the local process and enqueues the current task into the `default`
queue. From there, an agent assigned to the queue can pull and launch it.
## Running a Function Remotely
You can execute a specific function remotely using [`Task.create_function_task()`](../references/sdk/task.md#create_function_task).
This method creates a ClearML Task from a Python function and runs it on a remote machine.
For example:
```python
from clearml import Task
task = Task.init(project_name="myProject", task_name="Remote function")
def run_me_remotely(some_argument):
print(some_argument)
a_func_task = task.create_function_task(
func=run_me_remotely,
func_name='func_id_run_me_remotely',
task_name='a func task',
# everything below will be passed directly to our function as arguments
some_argument=123
)
```
:::important Function Task Creation
Function tasks must be created from within a regular task, created by calling `Task.init`
:::
Arguments passed to the function will be automatically logged in the task's **CONFIGURATION** tab under the **HYPERPARAMETERS > Function section**.
Like any other arguments, they can be changed from the UI or programmatically.

View File

@@ -0,0 +1,82 @@
---
title: Reproducing Tasks
---
:::note
This tutorial assumes that you've already set up [ClearML](../clearml_sdk/clearml_sdk_setup.md) and [ClearML Agent](../clearml_agent/clearml_agent_setup.md).
:::
Tasks can be reproduced--or **Cloned**--for validation or as a baseline for further experimentation. When you initialize a task in your
code, ClearML logs everything needed to reproduce your task and its environment:
* Uncommitted changes
* Used packages and their versions
* Parameters
* and more
Cloning a task duplicates the task's configuration, but not its outputs.
ClearML offers two ways to clone your task:
* [Via the WebApp](#via-the-webapp)--no further code required
* [Via programmatic interface](#via-programmatic-interface) using the `clearml` Python package
Once you have cloned your task, you can modify its setup, and then execute it remotely on a machine of your choice using a ClearML Agent.
## Via the WebApp
**To clone a task in the ClearML WebApp:**
1. Click on any project card to open its [task table](../webapp/webapp_exp_table.md).
1. Right-click the task you want to reproduce.
1. Click **Clone** in the context menu, which will open a **CLONE TASK** window.
1. Click **CLONE** in the window.
The newly cloned task's details page will open up. The cloned task is in *draft* mode, which means
it can be modified. You can edit any of the Task's setup details, including:
* Git and/or code references
* Python packages to be installed
* Container image to be used
You can adjust the values of the task's hyperparameters and configuration files. See [Modifying Tasks](../webapp/webapp_exp_tuning.md#modifying-tasks) for more
information about editing tasks in the UI.
### Enqueue a Task
Once you have set up a task, it is now time to execute it.
**To execute a task through the ClearML WebApp:**
1. In the task's details page, click "Menu" <img src="/docs/latest/icons/ico-bars-menu.svg" alt="Menu" className="icon size-md space-sm" />
1. Click **ENQUEUE** to open the **ENQUEUE TASK** window
1. In the window, select `default` in the `Queue` menu
1. Click **ENQUEUE**
This action pushes the task into the `default` queue. The task's status becomes *Pending* until an agent
assigned to the queue fetches it, at which time the task's status becomes *Running*. The agent executes the
task, and the task can be [tracked and its results visualized](../webapp/webapp_exp_track_visual.md).
## Via Programmatic Interface
The cloning, modifying, and enqueuing actions described above can also be performed programmatically using `clearml`.
### Clone the Task
To duplicate the task, use [`Task.clone()`](../references/sdk/task.md#taskclone), and input either a
Task object or the Task's ID as the `source_task` argument.
```python
cloned_task = Task.clone(source_task='qw03485je3hap903ere54')
```
The cloned task is in *draft* mode, which means it can be modified. For modification options, such as setting new parameter
values, see [Task SDK](../clearml_sdk/task_sdk.md).
### Enqueue the Task
To enqueue the task, use [`Task.enqueue()`](../references/sdk/task.md#taskenqueue), and input the Task object
with the `task` argument, and the queue to push the task into with `queue_name`.
```python
Task.enqueue(task=cloned_task, queue_name='default')
```
This action pushes the task into the `default` queue. The task's status becomes *Pending* until an agent
assigned to the queue fetches it, at which time the task's status becomes *Running*. The agent executes the
task, and the task can be [tracked and its results visualized](../webapp/webapp_exp_track_visual.md).

View File

@@ -0,0 +1,41 @@
---
title: Scheduling and Triggering Task Execution
---
In ClearML, tasks can be scheduled and triggered automatically, enabling seamless workflow automation. This section
provides an overview of the mechanisms available for managing task scheduling and event-based
triggering.
## Task Scheduling
Task scheduling allows users to define one-shot or periodic executions at specified times and intervals. This
is useful for:
* Running routine operations such as periodic model training, evaluation jobs, backups, and reports.
* Automating data ingestion and preprocessing workflows.
* Ensuring regular execution of monitoring and reporting tasks.
ClearML's offers the following scheduling solutions:
* [**UI Application**](../webapp/applications/apps_task_scheduler.md) (available under the Enterprise Plan) - The **Task Scheduler** app
provides a simple no-code interface for managing task schedules.
* [**Python Interface**](../references/sdk/scheduler.md) - Use the `TaskScheduler` class to programmatically manage
task schedules.
## Task Execution Triggering
ClearML's trigger manager enables you to automate task execution based on event occurence in the ClearML system, such as:
* Changes in task status (e.g. running, completed, etc.)
* Publication, archiving, or tagging of tasks, models, or datasets
* Task metrics crossing predefined thresholds
This is useful for:
* Triggering a training task when a dataset has been tagged as `latest` or any other tag
* Running an inference task when a model has been published
* Retraining a model when accuracy falls below a certain threshold
* And more
ClearML's offers the following trigger management solutions:
* [**UI Application**](../webapp/applications/apps_trigger_manager.md) (available under the Enterprise Plan) - The **Trigger Manager** app
provides a simple no-code interface for managing task triggers .
* [**Python Interface**](../references/sdk/trigger.md) - Use the `TriggerScheduler` class to programmatically manage
task triggers.

View File

@@ -0,0 +1,46 @@
---
title: Tracking Tasks
---
Every ClearML [task](../fundamentals/task.md) you create can be found in the **All Tasks** table and in its project's
task table.
The task table is a powerful tool for creating dashboards and views of your own projects, your team's projects, or the
entire development.
![Task table](../img/webapp_experiment_table.png#light-mode-only)
![Task table](../img/webapp_experiment_table_dark.png#dark-mode-only)
Customize the [task table](../webapp/webapp_exp_table.md) to fit your own needs by adding views of parameters, metrics, and tags.
Filter and sort based on various criteria, such as parameters and metrics, making it simple to create custom
views. This allows you to:
* Create a dashboard for a project, presenting the latest model accuracy scores, for immediate insights.
* Create a live leaderboard displaying the best-performing tasks, updated in real time
* Monitor a projects' progress and share it across the organization.
## Creating Leaderboards
To create a leaderboard:
1. Select a project in the ClearML WebApp and go to its task table
1. Customize the column selection. Click "Settings" <img src="/docs/latest/icons/ico-settings.svg" alt="Setting Gear" className="icon size-md" />
to view and select columns to display.
1. Filter tasks by name using the search bar to find tasks containing any search term
1. Filter by other categories by clicking "Filter" <img src="/docs/latest/icons/ico-filter-off.svg" alt="Filter" className="icon size-md" />
on the relevant column. There are a few types of filters:
* Value set - Choose which values to include from a list of all values in the column
* Numerical ranges - Insert minimum and/or maximum value
* Date ranges - Insert starting and/or ending date and time
* Tags - Choose which tags to filter by from a list of all tags used in the column.
* Filter by multiple tag values using the **ANY** or **ALL** options, which correspond to the logical "AND" and "OR" respectively. These
options appear on the top of the tag list.
* Filter by the absence of a tag (logical "NOT") by clicking its checkbox twice. An `X` will appear in the tag's checkbox.
1. Enable auto-refresh for real-time monitoring
For more detailed instructions, see the [Tracking Leaderboards Tutorial](../guides/ui/building_leader_board.md).
## Sharing Leaderboards
Bookmark the URL of your customized leaderboard to save and share your view. The URL contains all parameters and values
for your specific leaderboard view.

View File

@@ -7,7 +7,7 @@ on a remote or local machine, from a remote repository and your local machine.
### Prerequisites
- [`clearml`](../../getting_started/ds/ds_first_steps.md) Python package installed and configured
- [`clearml`](../../clearml_sdk/clearml_sdk_setup) Python package installed and configured
- [`clearml-agent`](../../clearml_agent/clearml_agent_setup.md#installation) running on at least one machine (to execute the task), configured to listen to `default` queue
### Executing Code from a Remote Repository

View File

@@ -9,7 +9,7 @@ script.
## Prerequisites
* [`clearml-agent`](../../clearml_agent/clearml_agent_setup.md#installation) installed and configured
* [`clearml`](../../getting_started/ds/ds_first_steps.md#install-clearml) installed and configured
* [`clearml`](../../clearml_sdk/clearml_sdk_setup#install-clearml) installed and configured
* [clearml](https://github.com/clearml/clearml) repo cloned (`git clone https://github.com/clearml/clearml.git`)
## Creating the ClearML Task

View File

@@ -11,7 +11,7 @@ be used when running optimization tasks.
## Prerequisites
* [`clearml-agent`](../../clearml_agent/clearml_agent_setup.md#installation) installed and configured
* [`clearml`](../../getting_started/ds/ds_first_steps.md#install-clearml) installed and configured
* [`clearml`](../../clearml_sdk/clearml_sdk_setup#install-clearml) installed and configured
* [clearml](https://github.com/clearml/clearml) repo cloned (`git clone https://github.com/clearml/clearml.git`)
## Creating the ClearML Task

View File

@@ -3,10 +3,10 @@ title: Keras Tuner
---
:::tip
If you are not already using ClearML, see [Getting Started](../../../getting_started/ds/ds_first_steps.md) for setup
instructions.
If you are not already using ClearML, see [ClearML Setup instructions](../clearml_sdk/clearml_sdk_setup).
:::
Integrate ClearML into code that uses [Keras Tuner](https://www.tensorflow.org/tutorials/keras/keras_tuner). By
specifying `ClearMLTunerLogger` (see [kerastuner.py](https://github.com/clearml/clearml/blob/master/clearml/external/kerastuner.py))
as the Keras Tuner logger, ClearML automatically logs scalars and hyperparameter optimization.

View File

@@ -1,6 +1,6 @@
---
id: guidemain
title: Examples
title: ClearML Tutorials
slug: /guides
---

View File

@@ -1,6 +1,10 @@
---
title: Tasks
title: Dataviews
---
:::important ENTERPRISE FEATURE
Dataviews available under the ClearML Enterprise plan.
:::
Hyper-Datasets extend the ClearML [**Task**](../fundamentals/task.md) with [Dataviews](dataviews.md).

View File

@@ -2,6 +2,10 @@
title: Annotation Tasks
---
:::important ENTERPRISE FEATURE
Annotation tasks are available under the ClearML Enterprise plan.
:::
Use the Annotations page to access and manage annotation Tasks.
Use annotation tasks to efficiently organize the annotation of frames in Dataset versions and manage the work of annotators

View File

@@ -2,6 +2,10 @@
title: Hyper-Datasets Page
---
:::important ENTERPRISE FEATURE
Hyper-Datasets are available under the ClearML Enterprise plan.
:::
Use the Hyper-Datasets Page to navigate between and manage hyper-datasets.
You can view the Hyper-Datasets page in Project view <img src="/docs/latest/icons/ico-project-view.svg" alt="Project view" className="icon size-md" />

View File

@@ -2,6 +2,10 @@
title: Working with Frames
---
:::important ENTERPRISE FEATURE
Hyper-Datasets are available under the ClearML Enterprise plan.
:::
View and edit SingleFrames in the Dataset page. After selecting a Hyper-Dataset version, the **Version Browser** shows a sample
of frames and enables viewing SingleFrames and FramesGroups, and editing SingleFrames, in the [frame viewer](#frame-viewer).
Before opening the frame viewer, you can filter the frames by applying [simple](webapp_datasets_versioning.md#simple-frame-filtering) or [advanced](webapp_datasets_versioning.md#advanced-frame-filtering)

View File

@@ -2,6 +2,10 @@
title: Dataset Versions
---
:::important ENTERPRISE FEATURE
Hyper-Datasets are available under the ClearML Enterprise plan.
:::
Use the Dataset versioning WebApp (UI) features for viewing, creating, modifying, and
deleting [Dataset versions](../dataset.md#dataset-versioning).

View File

@@ -2,6 +2,10 @@
title: The Dataview Table
---
:::important ENTERPRISE FEATURE
Dataviews are available under the ClearML Enterprise plan.
:::
The **Dataview table** is a [customizable](#customizing-the-dataview-table) list of Dataviews associated with a project.
Use it to view and create Dataviews, and access their info panels.

View File

@@ -2,6 +2,10 @@
title: Comparing Dataviews
---
:::important ENTERPRISE FEATURE
Dataviews are available under the ClearML Enterprise plan.
:::
In addition to [ClearML's comparison features](../../webapp/webapp_exp_comparing.md), the ClearML Enterprise WebApp
supports comparing input data selection criteria of task [Dataviews](../dataviews.md), enabling to easily locate, visualize, and analyze differences.

View File

@@ -2,6 +2,10 @@
title: Modifying Dataviews
---
:::important ENTERPRISE FEATURE
Dataviews are available under the ClearML Enterprise plan.
:::
A task that has been executed can be [cloned](../../webapp/webapp_exp_reproducing.md), then the cloned task's
execution details can be modified, and the modified task can be executed.

View File

@@ -2,6 +2,10 @@
title: Task Dataviews
---
:::important ENTERPRISE FEATURE
Dataviews are available under the ClearML Enterprise plan.
:::
While a task is running, and any time after it finishes, results are tracked and can be visualized in the ClearML
Enterprise WebApp (UI).

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.2 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.6 KiB

BIN
docs/img/app_cond_str.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 12 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

BIN
docs/img/app_group.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 12 KiB

BIN
docs/img/app_group_dark.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 13 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

BIN
docs/img/app_log.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

BIN
docs/img/app_log_dark.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 45 KiB

BIN
docs/img/app_plot.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 116 KiB

BIN
docs/img/app_plot_dark.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 134 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.7 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.6 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 522 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 441 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 388 KiB

After

Width:  |  Height:  |  Size: 372 KiB

Some files were not shown because too many files have changed in this diff Show More