From 1d1ffd17fbc26801d400a2cb6e7a58ef78ab0e8b Mon Sep 17 00:00:00 2001
From: allegroai <>
Date: Sun, 31 Jul 2022 19:36:48 +0300
Subject: [PATCH] Fix README
---
README.md | 196 +++++++++++++++++++++++++++++++++---------------------
1 file changed, 121 insertions(+), 75 deletions(-)
diff --git a/README.md b/README.md
index 5177aab..22b8cdb 100644
--- a/README.md
+++ b/README.md
@@ -9,14 +9,14 @@ ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows**
[](https://img.shields.io/pypi/pyversions/clearml-agent.svg)
[](https://img.shields.io/pypi/v/clearml-agent.svg)
[](https://pypi.org/project/clearml-agent/)
-[](https://artifacthub.io/packages/search?repo=allegroai)
+[](https://artifacthub.io/packages/search?repo=allegroai)
---
### ClearML-Agent
-#### *Formerly known as Trains Agent*
+#### *Formerly known as Trains Agent*
* Run jobs (experiments) on any local or cloud based resource
* Implement optimized resource utilization policies
@@ -24,23 +24,31 @@ ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows**
* Launch-and-Forget service containers
* [Cloud autoscaling](https://clear.ml/docs/latest/docs/guides/services/aws_autoscaler)
* [Customizable cleanup](https://clear.ml/docs/latest/docs/guides/services/cleanup_service)
-* Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)
+*
+Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)
It is a zero configuration fire-and-forget execution agent, providing a full ML/DL cluster solution.
**Full Automation in 5 steps**
-1. ClearML Server [self-hosted](https://github.com/allegroai/clearml-server) or [free tier hosting](https://app.clear.ml)
-2. `pip install clearml-agent` ([install](#installing-the-clearml-agent) the ClearML Agent on any GPU machine: on-premises / cloud / ...)
-3. Create a [job](https://github.com/allegroai/clearml/docs/clearml-task.md) or Add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines
-4. Change the [parameters](#using-the-clearml-agent) in the UI & schedule for [execution](#using-the-clearml-agent) (or automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
+
+1. ClearML Server [self-hosted](https://github.com/allegroai/clearml-server)
+ or [free tier hosting](https://app.clear.ml)
+2. `pip install clearml-agent` ([install](#installing-the-clearml-agent) the ClearML Agent on any GPU machine:
+ on-premises / cloud / ...)
+3. Create a [job](https://github.com/allegroai/clearml/docs/clearml-task.md) or
+ Add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines
+4. Change the [parameters](#using-the-clearml-agent) in the UI & schedule for [execution](#using-the-clearml-agent) (or
+ automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
5. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes: :beer:
"All the Deep/Machine-Learning DevOps your research needs, and then some... Because ain't nobody got time for that"
-**Try ClearML now** [Self Hosted](https://github.com/allegroai/clearml-server) or [Free tier Hosting](https://app.clear.ml)
+**Try ClearML now** [Self Hosted](https://github.com/allegroai/clearml-server)
+or [Free tier Hosting](https://app.clear.ml)
### Simple, Flexible Experiment Orchestration
+
**The ClearML Agent was built to address the DL/ML R&D DevOps needs:**
* Easily add & remove machines from the cluster
@@ -56,20 +64,23 @@ It is a zero configuration fire-and-forget execution agent, providing a full ML/
*epsilon - Because we are :triangular_ruler: and nothing is really zero work
-
### Kubernetes Integration (Optional)
-We think Kubernetes is awesome, but it should be a choice.
-We designed `clearml-agent` so you can run bare-metal or inside a pod with any mix that fits your environment.
+
+We think Kubernetes is awesome, but it should be a choice. We designed `clearml-agent` so you can run bare-metal or
+inside a pod with any mix that fits your environment.
Find Dockerfiles in the [docker](./docker) dir and a helm Chart in https://github.com/allegroai/clearml-helm-charts
-#### Benefits of integrating existing K8s with ClearML-Agent
+
+#### Benefits of integrating existing K8s with ClearML-Agent
+
- ClearML-Agent adds the missing scheduling capabilities to K8s
- Allowing for more flexible automation from code
- A programmatic interface for easier learning curve (and debugging)
- Seamless integration with ML/DL experiment manager
-- Web UI for customization, scheduling & prioritization of jobs
+- Web UI for customization, scheduling & prioritization of jobs
+
+**Two K8s integration flavours**
-**Two K8s integration flavours**
- Spin ClearML-Agent as a long-lasting service pod
- use [clearml-agent](https://hub.docker.com/r/allegroai/clearml-agent) docker image
- map docker socket into the pod (soon replaced by [podman](https://github.com/containers/podman))
@@ -77,57 +88,66 @@ Find Dockerfiles in the [docker](./docker) dir and a helm Chart in https://githu
- benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
- downside: Sibling containers
- Kubernetes Glue, map ClearML jobs directly to K8s jobs
- - Run the [clearml-k8s glue](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) on a K8s cpu node
- - The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided yaml template)
- - Inside the pod itself the clearml-agent will install the job (experiment) environment and spin and monitor the experiment's process
+ - Run the [clearml-k8s glue](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) on
+ a K8s cpu node
+ - The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided
+ yaml template)
+ - Inside the pod itself the clearml-agent will install the job (experiment) environment and spin and monitor the
+ experiment's process
- benefits: Kubernetes full view of all running jobs in the system
- - downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)
+ - downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)
### Using the ClearML Agent
+
**Full scale HPC with a click of a button**
-The ClearML Agent is a job scheduler that listens on job queue(s), pulls jobs, sets the job environments, executes the job and monitors its progress.
+The ClearML Agent is a job scheduler that listens on job queue(s), pulls jobs, sets the job environments, executes the
+job and monitors its progress.
Any 'Draft' experiment can be scheduled for execution by a ClearML agent.
A previously run experiment can be put into 'Draft' state by either of two methods:
-* Using the **'Reset'** action from the experiment right-click context menu in the
- ClearML UI - This will clear any results and artifacts the previous run had created.
-* Using the **'Clone'** action from the experiment right-click context menu in the
- ClearML UI - This will create a new 'Draft' experiment with the same configuration as the original experiment.
-An experiment is scheduled for execution using the **'Enqueue'** action from the experiment
- right-click context menu in the ClearML UI and selecting the execution queue.
+* Using the **'Reset'** action from the experiment right-click context menu in the ClearML UI - This will clear any
+ results and artifacts the previous run had created.
+* Using the **'Clone'** action from the experiment right-click context menu in the ClearML UI - This will create a new '
+ Draft' experiment with the same configuration as the original experiment.
+
+An experiment is scheduled for execution using the **'Enqueue'** action from the experiment right-click context menu in
+the ClearML UI and selecting the execution queue.
See [creating an experiment and enqueuing it for execution](#from-scratch).
Once an experiment is enqueued, it will be picked up and executed by a ClearML agent monitoring this queue.
The ClearML UI Workers & Queues page provides ongoing execution information:
- - Workers Tab: Monitor you cluster
+
+- Workers Tab: Monitor you cluster
- Review available resources
- Monitor machines statistics (CPU / GPU / Disk / Network)
- - Queues Tab:
+- Queues Tab:
- Control the scheduling order of jobs
- Cancel or abort job execution
- Move jobs between execution queues
#### What The ClearML Agent Actually Does
+
The ClearML Agent executes experiments using the following process:
- - Create a new virtual environment (or launch the selected docker image)
- - Clone the code into the virtual-environment (or inside the docker)
- - Install python packages based on the package requirements listed for the experiment
- - Special note for PyTorch: The ClearML Agent will automatically select the
- torch packages based on the CUDA_VERSION environment variable of the machine
- - Execute the code, while monitoring the process
- - Log all stdout/stderr in the ClearML UI, including the cloning and installation process, for easy debugging
- - Monitor the execution and allow you to manually abort the job using the ClearML UI (or, in the unfortunate case of a code crash, catch the error and signal the experiment has failed)
+
+- Create a new virtual environment (or launch the selected docker image)
+- Clone the code into the virtual-environment (or inside the docker)
+- Install python packages based on the package requirements listed for the experiment
+ - Special note for PyTorch: The ClearML Agent will automatically select the torch packages based on the CUDA_VERSION
+ environment variable of the machine
+- Execute the code, while monitoring the process
+- Log all stdout/stderr in the ClearML UI, including the cloning and installation process, for easy debugging
+- Monitor the execution and allow you to manually abort the job using the ClearML UI (or, in the unfortunate case of a
+ code crash, catch the error and signal the experiment has failed)
#### System Design & Flow
-
#### Installing the ClearML Agent
```bash
@@ -137,6 +157,7 @@ pip install clearml-agent
#### ClearML Agent Usage Examples
Full Interface and capabilities are available with
+
```bash
clearml-agent --help
clearml-agent daemon --help
@@ -148,7 +169,8 @@ clearml-agent daemon --help
clearml-agent init
```
-Note: The ClearML Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default ClearML Agent cache folder is `~/.clearml`
+Note: The ClearML Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default
+ClearML Agent cache folder is `~/.clearml`
See full details in your configuration file at `~/clearml.conf`
@@ -158,29 +180,36 @@ They are designed to share the same configuration file, see example [here](docs/
#### Running the ClearML Agent
For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen
+
```bash
clearml-agent daemon --queue default --foreground
```
For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe)
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
+
```bash
clearml-agent daemon --detached --queue default
```
-GPU allocation is controlled via the standard OS environment `NVIDIA_VISIBLE_DEVICES` or `--gpus` flag (or disabled with `--cpu-only`).
+GPU allocation is controlled via the standard OS environment `NVIDIA_VISIBLE_DEVICES` or `--gpus` flag (or disabled
+with `--cpu-only`).
-If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPU's will be allocated for the `clearml-agent`
-If `--cpu-only` flag is set, or `NVIDIA_VISIBLE_DEVICES` is an empty string (""), no gpu will be allocated for the `clearml-agent`
+If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPU's will be allocated for
+the `clearml-agent`
+If `--cpu-only` flag is set, or `NVIDIA_VISIBLE_DEVICES` is an empty string (""), no gpu will be allocated for
+the `clearml-agent`
Example: spin two agents, one per gpu on the same machine:
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
+
```bash
clearml-agent daemon --detached --gpus 0 --queue default
clearml-agent daemon --detached --gpus 1 --queue default
```
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent
+
```bash
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu
@@ -189,23 +218,29 @@ clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu
##### Starting the ClearML Agent in docker mode
For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen
+
```bash
clearml-agent daemon --queue default --docker --foreground
```
For actual service mode, all the stdout will be stored automatically into a file (no need to pipe)
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
+
```bash
clearml-agent daemon --detached --queue default --docker
```
-Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 docker:
+Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
+docker:
+
```bash
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
```
-Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent, with default nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 docker:
+Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent, with default nvidia/cuda:
+10.1-cudnn7-runtime-ubuntu18.04 docker:
+
```bash
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
@@ -216,55 +251,61 @@ clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda
Priority Queues are also supported, example use case:
High priority queue: `important_jobs` Low priority queue: `default`
+
```bash
clearml-agent daemon --queue important_jobs default
```
-The **ClearML Agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from the `default` queue.
-Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see example on our [free server](https://app.clear.ml/workers-and-queues/queues)
+The **ClearML Agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from
+the `default` queue.
+
+Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see
+example on our [free server](https://app.clear.ml/workers-and-queues/queues)
##### Stopping the ClearML Agent
-To stop a **ClearML Agent** running in the background, run the same command line used to start the agent with `--stop` appended.
-For example, to stop the first of the above shown same machine, single gpu agents:
+To stop a **ClearML Agent** running in the background, run the same command line used to start the agent with `--stop`
+appended. For example, to stop the first of the above shown same machine, single gpu agents:
+
```bash
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 --stop
```
### How do I create an experiment on the ClearML Server?
+
* Integrate [ClearML](https://github.com/allegroai/clearml) with your code
* Execute the code on your machine (Manually / PyCharm / Jupyter Notebook)
* As your code is running, **ClearML** creates an experiment logging all the necessary execution information:
- - Git repository link and commit ID (or an entire jupyter notebook)
- - Git diff (we’re not saying you never commit and push, but still...)
- - Python packages used by your code (including specific versions used)
- - Hyper-Parameters
- - Input Artifacts
+ - Git repository link and commit ID (or an entire jupyter notebook)
+ - Git diff (we’re not saying you never commit and push, but still...)
+ - Python packages used by your code (including specific versions used)
+ - Hyper-Parameters
+ - Input Artifacts
You now have a 'template' of your experiment with everything required for automated execution
-* In the ClearML UI, Right click on the experiment and select 'clone'. A copy of your experiment will be created.
+* In the ClearML UI, Right-click on the experiment and select 'clone'. A copy of your experiment will be created.
* You now have a new draft experiment cloned from your original experiment, feel free to edit it
- - Change the Hyper-Parameters
- - Switch to the latest code base of the repository
- - Update package versions
- - Select a specific docker image to run in (see docker execution mode section)
- - Or simply change nothing to run the same experiment again...
+ - Change the Hyper-Parameters
+ - Switch to the latest code base of the repository
+ - Update package versions
+ - Select a specific docker image to run in (see docker execution mode section)
+ - Or simply change nothing to run the same experiment again...
* Schedule the newly created experiment for execution: Right-click the experiment and select 'enqueue'
### ClearML-Agent Services Mode
-ClearML-Agent Services is a special mode of ClearML-Agent that provides the ability to launch long-lasting jobs
-that previously had to be executed on local / dedicated machines. It allows a single agent to
-launch multiple dockers (Tasks) for different use cases. To name a few use cases, auto-scaler service (spinning instances
-when the need arises and the budget allows), Controllers (Implementing pipelines and more sophisticated DevOps logic),
-Optimizer (such as Hyper-parameter Optimization or sweeping), and Application (such as interactive Bokeh apps for
-increased data transparency)
+ClearML-Agent Services is a special mode of ClearML-Agent that provides the ability to launch long-lasting jobs that
+previously had to be executed on local / dedicated machines. It allows a single agent to launch multiple dockers (Tasks)
+for different use cases. To name a few use cases, auto-scaler service (spinning instances when the need arises and the
+budget allows), Controllers (Implementing pipelines and more sophisticated DevOps logic), Optimizer (such as
+Hyper-parameter Optimization or sweeping), and Application (such as interactive Bokeh apps for increased data
+transparency)
-ClearML-Agent Services mode will spin **any** task enqueued into the specified queue.
-Every task launched by ClearML-Agent Services will be registered as a new node in the system,
-providing tracking and transparency capabilities.
-Currently clearml-agent in services-mode supports cpu only configuration. ClearML-agent services mode can be launched alongside GPU agents.
+ClearML-Agent Services mode will spin **any** task enqueued into the specified queue. Every task launched by
+ClearML-Agent Services will be registered as a new node in the system, providing tracking and transparency capabilities.
+Currently clearml-agent in services-mode supports cpu only configuration. ClearML-agent services mode can be launched
+alongside GPU agents.
```bash
clearml-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only
@@ -272,22 +313,27 @@ clearml-agent daemon --services-mode --detached --queue services --create-queue
**Note**: It is the user's responsibility to make sure the proper tasks are pushed into the specified queue.
-
### AutoML and Orchestration Pipelines
-The ClearML Agent can also be used to implement AutoML orchestration and Experiment Pipelines in conjunction with the ClearML package.
-Sample AutoML & Orchestration examples can be found in the ClearML [example/automation](https://github.com/allegroai/clearml/tree/master/examples/automation) folder.
+The ClearML Agent can also be used to implement AutoML orchestration and Experiment Pipelines in conjunction with the
+ClearML package.
+
+Sample AutoML & Orchestration examples can be found in the
+ClearML [example/automation](https://github.com/allegroai/clearml/tree/master/examples/automation) folder.
AutoML examples
- - [Toy Keras training experiment](https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py)
+
+- [Toy Keras training experiment](https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py)
- In order to create an experiment-template in the system, this code must be executed once manually
- - [Random Search over the above Keras experiment-template](https://github.com/allegroai/clearml/blob/master/examples/automation/manual_random_param_search_example.py)
- - This example will create multiple copies of the Keras experiment-template, with different hyper-parameter combinations
+- [Random Search over the above Keras experiment-template](https://github.com/allegroai/clearml/blob/master/examples/automation/manual_random_param_search_example.py)
+ - This example will create multiple copies of the Keras experiment-template, with different hyper-parameter
+ combinations
Experiment Pipeline examples
- - [First step experiment](https://github.com/allegroai/clearml/blob/master/examples/automation/task_piping_example.py)
+
+- [First step experiment](https://github.com/allegroai/clearml/blob/master/examples/automation/task_piping_example.py)
- This example will "process data", and once done, will launch a copy of the 'second step' experiment-template
- - [Second step experiment](https://github.com/allegroai/clearml/blob/master/examples/automation/toy_base_task.py)
+- [Second step experiment](https://github.com/allegroai/clearml/blob/master/examples/automation/toy_base_task.py)
- In order to create an experiment-template in the system, this code must be executed once manually
### License