clearml-agent/README.md

345 lines
16 KiB
Markdown
Raw Normal View History

2020-12-22 21:00:57 +00:00
<div align="center">
2019-10-29 01:34:40 +00:00
2020-12-22 22:11:47 +00:00
<img src="https://github.com/allegroai/clearml-agent/blob/master/docs/clearml_agent_logo.png?raw=true" width="250px">
2020-12-22 21:00:57 +00:00
**ClearML Agent - ML-Ops made easy
2020-12-22 21:58:39 +00:00
ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows**
2019-10-29 01:34:40 +00:00
2021-04-12 20:01:22 +00:00
[![GitHub license](https://img.shields.io/github/license/allegroai/clearml-agent.svg)](https://img.shields.io/github/license/allegroai/clearml-agent.svg)
2020-12-22 21:00:57 +00:00
[![PyPI pyversions](https://img.shields.io/pypi/pyversions/clearml-agent.svg)](https://img.shields.io/pypi/pyversions/clearml-agent.svg)
[![PyPI version shields.io](https://img.shields.io/pypi/v/clearml-agent.svg)](https://img.shields.io/pypi/v/clearml-agent.svg)
2022-04-01 14:48:27 +00:00
[![PyPI Downloads](https://pepy.tech/badge/clearml-agent/month)](https://pypi.org/project/clearml-agent/)
2022-07-31 16:36:48 +00:00
[![Artifact Hub](https://img.shields.io/endpoint?url=https://artifacthub.io/badge/repository/allegroai)](https://artifacthub.io/packages/search?repo=allegroai)
2020-12-22 21:00:57 +00:00
</div>
2020-05-09 17:12:53 +00:00
2020-12-22 21:00:57 +00:00
---
2019-10-29 01:34:40 +00:00
2020-12-22 21:00:57 +00:00
### ClearML-Agent
2019-10-29 01:34:40 +00:00
2022-07-31 16:36:48 +00:00
#### *Formerly known as Trains Agent*
2019-10-29 11:18:51 +00:00
2020-12-22 21:00:57 +00:00
* Run jobs (experiments) on any local or cloud based resource
* Implement optimized resource utilization policies
* Deploy execution environments with either virtualenv or fully docker containerized with zero effort
* Launch-and-Forget service containers
* [Cloud autoscaling](https://clear.ml/docs/latest/docs/guides/services/aws_autoscaler)
* [Customizable cleanup](https://clear.ml/docs/latest/docs/guides/services/cleanup_service)
2022-10-18 23:18:56 +00:00
2022-07-31 16:36:48 +00:00
Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)
2019-10-29 11:18:51 +00:00
2020-12-22 21:00:57 +00:00
It is a zero configuration fire-and-forget execution agent, providing a full ML/DL cluster solution.
**Full Automation in 5 steps**
2022-07-31 16:36:48 +00:00
1. ClearML Server [self-hosted](https://github.com/allegroai/clearml-server)
or [free tier hosting](https://app.clear.ml)
2022-10-18 23:44:53 +00:00
2. `pip install clearml-agent` ([install](#installing-the-clearml-agent) the ClearML Agent on any CPU/GPU machine:
2022-07-31 16:36:48 +00:00
on-premises / cloud / ...)
3. Create a [job](https://github.com/allegroai/clearml/docs/clearml-task.md) or
Add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines
4. Change the [parameters](#using-the-clearml-agent) in the UI & schedule for [execution](#using-the-clearml-agent) (or
2022-10-18 23:44:53 +00:00
automate with a [pipelines](#automl-and-orchestration-pipelines-))
2020-12-22 21:00:57 +00:00
5. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes: :beer:
2019-10-29 11:18:51 +00:00
2020-12-22 21:00:57 +00:00
"All the Deep/Machine-Learning DevOps your research needs, and then some... Because ain't nobody got time for that"
2019-10-29 01:34:40 +00:00
2022-07-31 16:36:48 +00:00
**Try ClearML now** [Self Hosted](https://github.com/allegroai/clearml-server)
or [Free tier Hosting](https://app.clear.ml)
2022-01-27 10:15:36 +00:00
<a href="https://app.clear.ml"><img src="https://github.com/allegroai/clearml-agent/blob/master/docs/screenshots.gif?raw=true" width="100%"></a>
2019-10-29 01:34:40 +00:00
2020-12-22 21:00:57 +00:00
### Simple, Flexible Experiment Orchestration
2022-07-31 16:36:48 +00:00
2020-12-22 21:00:57 +00:00
**The ClearML Agent was built to address the DL/ML R&D DevOps needs:**
2019-10-29 01:34:40 +00:00
* Easily add & remove machines from the cluster
* Reuse machines without the need for any dedicated containers or images
2019-10-29 19:21:42 +00:00
* **Combine GPU resources across any cloud and on-prem**
2020-12-22 21:00:57 +00:00
* **No need for yaml / json / template configuration of any kind**
2019-10-29 01:57:15 +00:00
* **User friendly UI**
2019-10-29 01:34:40 +00:00
* Manageable resource allocation that can be used by researchers and engineers
* Flexible and controllable scheduler with priority support
2020-12-22 21:00:57 +00:00
* Automatic instance spinning in the cloud
**Using the ClearML Agent, you can now set up a dynamic cluster with \*epsilon DevOps**
*epsilon - Because we are :triangular_ruler: and nothing is really zero work
### Kubernetes Integration (Optional)
2022-07-31 16:36:48 +00:00
We think Kubernetes is awesome, but it should be a choice. We designed `clearml-agent` so you can run bare-metal or
inside a pod with any mix that fits your environment.
2022-03-20 21:24:07 +00:00
Find Dockerfiles in the [docker](./docker) dir and a helm Chart in https://github.com/allegroai/clearml-helm-charts
2022-07-31 16:36:48 +00:00
#### Benefits of integrating existing K8s with ClearML-Agent
2020-12-22 21:00:57 +00:00
- ClearML-Agent adds the missing scheduling capabilities to K8s
- Allowing for more flexible automation from code
- A programmatic interface for easier learning curve (and debugging)
- Seamless integration with ML/DL experiment manager
2022-07-31 16:36:48 +00:00
- Web UI for customization, scheduling & prioritization of jobs
**Two K8s integration flavours**
2020-12-22 21:00:57 +00:00
- Spin ClearML-Agent as a long-lasting service pod
2021-04-12 20:01:22 +00:00
- use [clearml-agent](https://hub.docker.com/r/allegroai/clearml-agent) docker image
2020-12-22 21:00:57 +00:00
- map docker socket into the pod (soon replaced by [podman](https://github.com/containers/podman))
- allow the clearml-agent to manage sibling dockers
- benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
- downside: Sibling containers
- Kubernetes Glue, map ClearML jobs directly to K8s jobs
2022-07-31 16:36:48 +00:00
- Run the [clearml-k8s glue](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) on
a K8s cpu node
- The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided
yaml template)
- Inside the pod itself the clearml-agent will install the job (experiment) environment and spin and monitor the
experiment's process
2020-12-22 21:00:57 +00:00
- benefits: Kubernetes full view of all running jobs in the system
2022-07-31 16:36:48 +00:00
- downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)
2020-12-22 21:00:57 +00:00
### Using the ClearML Agent
2022-07-31 16:36:48 +00:00
2019-10-29 01:34:40 +00:00
**Full scale HPC with a click of a button**
2022-07-31 16:36:48 +00:00
The ClearML Agent is a job scheduler that listens on job queue(s), pulls jobs, sets the job environments, executes the
job and monitors its progress.
2019-10-29 01:34:40 +00:00
2020-12-22 21:00:57 +00:00
Any 'Draft' experiment can be scheduled for execution by a ClearML agent.
2019-10-29 01:34:40 +00:00
A previously run experiment can be put into 'Draft' state by either of two methods:
2022-07-31 16:36:48 +00:00
* Using the **'Reset'** action from the experiment right-click context menu in the ClearML UI - This will clear any
results and artifacts the previous run had created.
* Using the **'Clone'** action from the experiment right-click context menu in the ClearML UI - This will create a new '
Draft' experiment with the same configuration as the original experiment.
An experiment is scheduled for execution using the **'Enqueue'** action from the experiment right-click context menu in
the ClearML UI and selecting the execution queue.
2019-10-29 03:30:56 +00:00
2019-10-29 19:21:42 +00:00
See [creating an experiment and enqueuing it for execution](#from-scratch).
2019-10-29 01:34:40 +00:00
2020-12-22 21:00:57 +00:00
Once an experiment is enqueued, it will be picked up and executed by a ClearML agent monitoring this queue.
2019-10-29 01:34:40 +00:00
2020-12-22 21:00:57 +00:00
The ClearML UI Workers & Queues page provides ongoing execution information:
2022-07-31 16:36:48 +00:00
- Workers Tab: Monitor you cluster
2019-10-29 01:34:40 +00:00
- Review available resources
- Monitor machines statistics (CPU / GPU / Disk / Network)
2022-07-31 16:36:48 +00:00
- Queues Tab:
2019-10-29 01:34:40 +00:00
- Control the scheduling order of jobs
- Cancel or abort job execution
- Move jobs between execution queues
2020-12-22 21:00:57 +00:00
#### What The ClearML Agent Actually Does
2022-07-31 16:36:48 +00:00
2020-12-22 21:00:57 +00:00
The ClearML Agent executes experiments using the following process:
2022-07-31 16:36:48 +00:00
- Create a new virtual environment (or launch the selected docker image)
- Clone the code into the virtual-environment (or inside the docker)
- Install python packages based on the package requirements listed for the experiment
- Special note for PyTorch: The ClearML Agent will automatically select the torch packages based on the CUDA_VERSION
environment variable of the machine
- Execute the code, while monitoring the process
- Log all stdout/stderr in the ClearML UI, including the cloning and installation process, for easy debugging
- Monitor the execution and allow you to manually abort the job using the ClearML UI (or, in the unfortunate case of a
code crash, catch the error and signal the experiment has failed)
2020-12-22 21:00:57 +00:00
#### System Design & Flow
2021-06-29 04:58:29 +00:00
<img src="https://github.com/allegroai/clearml-agent/blob/master/docs/clearml_architecture.png" width="100%" alt="clearml-architecture">
2020-12-22 21:00:57 +00:00
#### Installing the ClearML Agent
2019-10-29 01:34:40 +00:00
```bash
2020-12-22 21:00:57 +00:00
pip install clearml-agent
2019-10-29 01:34:40 +00:00
```
2020-12-22 21:00:57 +00:00
#### ClearML Agent Usage Examples
2019-10-29 01:34:40 +00:00
Full Interface and capabilities are available with
2022-07-31 16:36:48 +00:00
2019-10-29 01:34:40 +00:00
```bash
2020-12-22 21:00:57 +00:00
clearml-agent --help
clearml-agent daemon --help
2019-10-29 01:34:40 +00:00
```
2020-12-22 21:00:57 +00:00
#### Configuring the ClearML Agent
2019-10-29 01:34:40 +00:00
```bash
2020-12-22 21:00:57 +00:00
clearml-agent init
2019-10-29 01:34:40 +00:00
```
2022-07-31 16:36:48 +00:00
Note: The ClearML Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default
ClearML Agent cache folder is `~/.clearml`
2019-10-29 01:34:40 +00:00
2020-12-22 21:00:57 +00:00
See full details in your configuration file at `~/clearml.conf`
2019-10-29 01:34:40 +00:00
2020-12-22 21:00:57 +00:00
Note: The **ClearML agent** extends the **ClearML** configuration file `~/clearml.conf`
They are designed to share the same configuration file, see example [here](docs/clearml.conf)
2019-10-29 01:34:40 +00:00
2020-12-22 21:00:57 +00:00
#### Running the ClearML Agent
2019-10-29 01:34:40 +00:00
2020-12-22 21:00:57 +00:00
For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen
2022-07-31 16:36:48 +00:00
2019-10-29 01:34:40 +00:00
```bash
2020-12-22 21:00:57 +00:00
clearml-agent daemon --queue default --foreground
2019-10-29 01:34:40 +00:00
```
For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe)
2020-12-22 21:00:57 +00:00
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
2022-07-31 16:36:48 +00:00
2019-10-29 01:34:40 +00:00
```bash
2020-12-22 21:00:57 +00:00
clearml-agent daemon --detached --queue default
2019-10-29 01:34:40 +00:00
```
2022-07-31 16:36:48 +00:00
GPU allocation is controlled via the standard OS environment `NVIDIA_VISIBLE_DEVICES` or `--gpus` flag (or disabled
with `--cpu-only`).
2019-10-29 16:06:35 +00:00
2022-07-31 16:36:48 +00:00
If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPU's will be allocated for
the `clearml-agent` <br>
If `--cpu-only` flag is set, or `NVIDIA_VISIBLE_DEVICES="none"`, no gpu will be allocated for
2022-07-31 16:36:48 +00:00
the `clearml-agent`
2019-10-29 16:06:35 +00:00
Example: spin two agents, one per gpu on the same machine:
2020-12-22 21:00:57 +00:00
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
2022-07-31 16:36:48 +00:00
2019-10-29 16:06:35 +00:00
```bash
2020-12-22 21:00:57 +00:00
clearml-agent daemon --detached --gpus 0 --queue default
clearml-agent daemon --detached --gpus 1 --queue default
2019-10-29 16:06:35 +00:00
```
2019-10-30 11:15:32 +00:00
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent
2022-07-31 16:36:48 +00:00
2019-10-29 16:06:35 +00:00
```bash
2020-12-22 21:00:57 +00:00
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu
2019-10-29 16:06:35 +00:00
```
2020-12-22 21:00:57 +00:00
##### Starting the ClearML Agent in docker mode
2019-10-29 01:34:40 +00:00
2020-12-22 21:00:57 +00:00
For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen
2022-07-31 16:36:48 +00:00
2019-10-29 01:34:40 +00:00
```bash
2020-12-22 21:00:57 +00:00
clearml-agent daemon --queue default --docker --foreground
2019-10-29 01:34:40 +00:00
```
For actual service mode, all the stdout will be stored automatically into a file (no need to pipe)
2020-12-22 21:00:57 +00:00
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
2022-07-31 16:36:48 +00:00
2019-10-29 01:34:40 +00:00
```bash
2020-12-22 21:00:57 +00:00
clearml-agent daemon --detached --queue default --docker
2019-10-29 01:34:40 +00:00
```
2022-07-31 16:36:48 +00:00
Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
docker:
2019-10-29 16:06:35 +00:00
```bash
2021-03-18 01:05:26 +00:00
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
2019-10-29 16:06:35 +00:00
```
2022-07-31 16:36:48 +00:00
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent, with default nvidia/cuda:
10.1-cudnn7-runtime-ubuntu18.04 docker:
2019-10-29 16:06:35 +00:00
```bash
2021-03-18 01:05:26 +00:00
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
2019-10-29 16:06:35 +00:00
```
2020-12-22 21:00:57 +00:00
##### Starting the ClearML Agent - Priority Queues
2019-10-29 01:34:40 +00:00
2019-10-29 03:30:56 +00:00
Priority Queues are also supported, example use case:
2019-10-29 01:34:40 +00:00
High priority queue: `important_jobs` Low priority queue: `default`
2022-07-31 16:36:48 +00:00
2019-10-29 01:34:40 +00:00
```bash
2020-12-22 21:00:57 +00:00
clearml-agent daemon --queue important_jobs default
2019-10-29 01:34:40 +00:00
```
2022-07-31 16:36:48 +00:00
The **ClearML Agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from
the `default` queue.
Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see
example on our [free server](https://app.clear.ml/workers-and-queues/queues)
2019-10-30 11:15:32 +00:00
2020-12-22 21:00:57 +00:00
##### Stopping the ClearML Agent
2020-11-19 10:36:58 +00:00
2022-07-31 16:36:48 +00:00
To stop a **ClearML Agent** running in the background, run the same command line used to start the agent with `--stop`
appended. For example, to stop the first of the above shown same machine, single gpu agents:
2020-11-19 10:36:58 +00:00
```bash
2021-03-18 01:05:26 +00:00
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 --stop
2020-11-19 10:36:58 +00:00
```
2020-12-22 21:00:57 +00:00
### How do I create an experiment on the ClearML Server? <a name="from-scratch"></a>
2022-07-31 16:36:48 +00:00
2021-04-12 20:01:22 +00:00
* Integrate [ClearML](https://github.com/allegroai/clearml) with your code
2019-10-29 01:34:40 +00:00
* Execute the code on your machine (Manually / PyCharm / Jupyter Notebook)
2020-12-22 21:00:57 +00:00
* As your code is running, **ClearML** creates an experiment logging all the necessary execution information:
2022-07-31 16:36:48 +00:00
- Git repository link and commit ID (or an entire jupyter notebook)
- Git diff (were not saying you never commit and push, but still...)
- Python packages used by your code (including specific versions used)
- Hyper-Parameters
- Input Artifacts
2019-10-29 03:30:56 +00:00
2019-10-29 01:34:40 +00:00
You now have a 'template' of your experiment with everything required for automated execution
2019-10-29 03:30:56 +00:00
2022-07-31 16:36:48 +00:00
* In the ClearML UI, Right-click on the experiment and select 'clone'. A copy of your experiment will be created.
2019-10-29 01:34:40 +00:00
* You now have a new draft experiment cloned from your original experiment, feel free to edit it
2022-07-31 16:36:48 +00:00
- Change the Hyper-Parameters
- Switch to the latest code base of the repository
- Update package versions
- Select a specific docker image to run in (see docker execution mode section)
- Or simply change nothing to run the same experiment again...
2019-10-29 19:21:42 +00:00
* Schedule the newly created experiment for execution: Right-click the experiment and select 'enqueue'
2019-10-29 16:06:35 +00:00
2020-12-22 21:00:57 +00:00
### ClearML-Agent Services Mode <a name="services"></a>
2020-06-01 21:58:52 +00:00
2022-07-31 16:36:48 +00:00
ClearML-Agent Services is a special mode of ClearML-Agent that provides the ability to launch long-lasting jobs that
previously had to be executed on local / dedicated machines. It allows a single agent to launch multiple dockers (Tasks)
for different use cases. To name a few use cases, auto-scaler service (spinning instances when the need arises and the
budget allows), Controllers (Implementing pipelines and more sophisticated DevOps logic), Optimizer (such as
Hyper-parameter Optimization or sweeping), and Application (such as interactive Bokeh apps for increased data
transparency)
2020-06-01 21:58:52 +00:00
2022-07-31 16:36:48 +00:00
ClearML-Agent Services mode will spin **any** task enqueued into the specified queue. Every task launched by
ClearML-Agent Services will be registered as a new node in the system, providing tracking and transparency capabilities.
Currently clearml-agent in services-mode supports cpu only configuration. ClearML-agent services mode can be launched
alongside GPU agents.
2020-06-01 21:58:52 +00:00
```bash
2020-12-22 21:00:57 +00:00
clearml-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only
2020-06-01 21:58:52 +00:00
```
2020-12-22 21:00:57 +00:00
**Note**: It is the user's responsibility to make sure the proper tasks are pushed into the specified queue.
2020-06-01 21:58:52 +00:00
2022-10-18 23:44:53 +00:00
### Orchestration and Pipelines <a name="automl-pipes"></a>
2019-10-29 16:06:35 +00:00
2022-10-18 23:44:53 +00:00
The ClearML Agent can also be used to orchestrate and automate Pipelines in conjunction with the
2022-07-31 16:36:48 +00:00
ClearML package.
2022-10-18 23:44:53 +00:00
Sample automation examples can be found in the
ClearML [pipelines](https://github.com/allegroai/clearml/tree/master/examples/pipeline) / [automation](https://github.com/allegroai/clearml/tree/master/examples/automation) folder.
2019-10-29 16:06:35 +00:00
2022-10-18 23:44:53 +00:00
HPO examples
2022-07-31 16:36:48 +00:00
- [Toy Keras training experiment](https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py)
2019-10-29 16:06:35 +00:00
- In order to create an experiment-template in the system, this code must be executed once manually
2022-10-18 23:44:53 +00:00
- [Manual Search over the above Keras experiment-template](https://github.com/allegroai/clearml/blob/master/examples/automation/manual_random_param_search_example.py)
2022-07-31 16:36:48 +00:00
- This example will create multiple copies of the Keras experiment-template, with different hyper-parameter
combinations
2022-10-18 23:44:53 +00:00
- [Optimized Bayesian search over the above Keras experiment-template](https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py)
- This example will create multiple copies of the Keras experiment-template, with different hyper-parameter combinations launch them on remote machines, monitor the metric (i.e. loss) decide which one has the best potential and abort the others
2019-10-29 16:06:35 +00:00
Experiment Pipeline examples
2022-07-31 16:36:48 +00:00
2022-10-18 23:47:10 +00:00
- [Build DAG from Tasks](https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_tasks.py)
- This example will build a DAG processing flow from existing Tasks and launch them on remote machines
- [Logic Driven Pipeline](https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py)
- This example will run any component function as a standalone Task on a remote machine, it will auto-parallelize jobs, cache results and automatically serialize data between remote machines.
2020-05-09 17:12:53 +00:00
2020-12-22 21:00:57 +00:00
### License
2020-05-09 17:12:53 +00:00
Apache License, Version 2.0 (see the [LICENSE](https://www.apache.org/licenses/LICENSE-2.0.html) for more information)