Edit README (#156)

This commit is contained in:
pollfly 2023-07-19 16:51:14 +03:00 committed by GitHub
parent 3838247716
commit 6b7ee12dc1
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -24,8 +24,7 @@ ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows**
* Launch-and-Forget service containers * Launch-and-Forget service containers
* [Cloud autoscaling](https://clear.ml/docs/latest/docs/guides/services/aws_autoscaler) * [Cloud autoscaling](https://clear.ml/docs/latest/docs/guides/services/aws_autoscaler)
* [Customizable cleanup](https://clear.ml/docs/latest/docs/guides/services/cleanup_service) * [Customizable cleanup](https://clear.ml/docs/latest/docs/guides/services/cleanup_service)
* * Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)
Advanced [pipeline building and execution](https://clear.ml/docs/latest/docs/guides/frameworks/pytorch/notebooks/table/tabular_training_pipeline)
It is a zero configuration fire-and-forget execution agent, providing a full ML/DL cluster solution. It is a zero configuration fire-and-forget execution agent, providing a full ML/DL cluster solution.
@ -35,8 +34,8 @@ It is a zero configuration fire-and-forget execution agent, providing a full ML/
or [free tier hosting](https://app.clear.ml) or [free tier hosting](https://app.clear.ml)
2. `pip install clearml-agent` ([install](#installing-the-clearml-agent) the ClearML Agent on any GPU machine: 2. `pip install clearml-agent` ([install](#installing-the-clearml-agent) the ClearML Agent on any GPU machine:
on-premises / cloud / ...) on-premises / cloud / ...)
3. Create a [job](https://github.com/allegroai/clearml/docs/clearml-task.md) or 3. Create a [job](https://clear.ml/docs/latest/docs/apps/clearml_task) or
Add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines add [ClearML](https://github.com/allegroai/clearml) to your code with just 2 lines of code
4. Change the [parameters](#using-the-clearml-agent) in the UI & schedule for [execution](#using-the-clearml-agent) (or 4. Change the [parameters](#using-the-clearml-agent) in the UI & schedule for [execution](#using-the-clearml-agent) (or
automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-)) automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
5. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes: :beer: 5. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes: :beer:
@ -81,21 +80,21 @@ Find Dockerfiles in the [docker](./docker) dir and a helm Chart in https://githu
**Two K8s integration flavours** **Two K8s integration flavours**
- Spin ClearML-Agent as a long-lasting service pod - Spin ClearML-Agent as a long-lasting service pod:
- use [clearml-agent](https://hub.docker.com/r/allegroai/clearml-agent) docker image - Use [clearml-agent](https://hub.docker.com/r/allegroai/clearml-agent) docker image
- map docker socket into the pod (soon replaced by [podman](https://github.com/containers/podman)) - map docker socket into the pod (soon replaced by [podman](https://github.com/containers/podman))
- allow the clearml-agent to manage sibling dockers - Allow the clearml-agent to manage sibling dockers
- benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc. - Benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
- downside: Sibling containers - Downside: sibling containers
- Kubernetes Glue, map ClearML jobs directly to K8s jobs - Kubernetes Glue, map ClearML jobs directly to K8s jobs:
- Run the [clearml-k8s glue](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) on - Run the [clearml-k8s glue](https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py) on
a K8s cpu node a K8s cpu node
- The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided - The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided
yaml template) yaml template)
- Inside the pod itself the clearml-agent will install the job (experiment) environment and spin and monitor the - Inside the pod itself the clearml-agent will install the job (experiment) environment and spin and monitor the
experiment's process experiment's process
- benefits: Kubernetes full view of all running jobs in the system - Benefits: Kubernetes full view of all running jobs in the system
- downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only) - Downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)
### Using the ClearML Agent ### Using the ClearML Agent
@ -110,15 +109,15 @@ A previously run experiment can be put into 'Draft' state by either of two metho
* Using the **'Reset'** action from the experiment right-click context menu in the ClearML UI - This will clear any * Using the **'Reset'** action from the experiment right-click context menu in the ClearML UI - This will clear any
results and artifacts the previous run had created. results and artifacts the previous run had created.
* Using the **'Clone'** action from the experiment right-click context menu in the ClearML UI - This will create a new ' * Using the **'Clone'** action from the experiment right-click context menu in the ClearML UI - This will create a new
Draft' experiment with the same configuration as the original experiment. 'Draft' experiment with the same configuration as the original experiment.
An experiment is scheduled for execution using the **'Enqueue'** action from the experiment right-click context menu in An experiment is scheduled for execution using the **'Enqueue'** action from the experiment right-click context menu in
the ClearML UI and selecting the execution queue. the ClearML UI and selecting the execution queue.
See [creating an experiment and enqueuing it for execution](#from-scratch). See [creating an experiment and enqueuing it for execution](#from-scratch).
Once an experiment is enqueued, it will be picked up and executed by a ClearML agent monitoring this queue. Once an experiment is enqueued, it will be picked up and executed by a ClearML Agent monitoring this queue.
The ClearML UI Workers & Queues page provides ongoing execution information: The ClearML UI Workers & Queues page provides ongoing execution information:
@ -170,22 +169,22 @@ clearml-agent init
``` ```
Note: The ClearML Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default Note: The ClearML Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default
ClearML Agent cache folder is `~/.clearml` ClearML Agent cache folder is `~/.clearml`.
See full details in your configuration file at `~/clearml.conf` See full details in your configuration file at `~/clearml.conf`.
Note: The **ClearML agent** extends the **ClearML** configuration file `~/clearml.conf` Note: The **ClearML Agent** extends the **ClearML** configuration file `~/clearml.conf`.
They are designed to share the same configuration file, see example [here](docs/clearml.conf) They are designed to share the same configuration file, see example [here](docs/clearml.conf)
#### Running the ClearML Agent #### Running the ClearML Agent
For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen For debug and experimentation, start the ClearML agent in `foreground` mode, where all the output is printed to screen:
```bash ```bash
clearml-agent daemon --queue default --foreground clearml-agent daemon --queue default --foreground
``` ```
For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe) For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe).
Notice: with `--detached` flag, the *clearml-agent* will be running in the background Notice: with `--detached` flag, the *clearml-agent* will be running in the background
```bash ```bash
@ -195,20 +194,21 @@ clearml-agent daemon --detached --queue default
GPU allocation is controlled via the standard OS environment `NVIDIA_VISIBLE_DEVICES` or `--gpus` flag (or disabled GPU allocation is controlled via the standard OS environment `NVIDIA_VISIBLE_DEVICES` or `--gpus` flag (or disabled
with `--cpu-only`). with `--cpu-only`).
If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPU's will be allocated for If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPUs will be allocated for
the `clearml-agent` <br> the `clearml-agent`. <br>
If `--cpu-only` flag is set, or `NVIDIA_VISIBLE_DEVICES="none"`, no gpu will be allocated for If `--cpu-only` flag is set, or `NVIDIA_VISIBLE_DEVICES="none"`, no gpu will be allocated for
the `clearml-agent` the `clearml-agent`.
Example: spin two agents, one per gpu on the same machine: Example: spin two agents, one per GPU on the same machine:
Notice: with `--detached` flag, the *clearml-agent* will be running in the background
Notice: with `--detached` flag, the *clearml-agent* will run in the background
```bash ```bash
clearml-agent daemon --detached --gpus 0 --queue default clearml-agent daemon --detached --gpus 0 --queue default
clearml-agent daemon --detached --gpus 1 --queue default clearml-agent daemon --detached --gpus 1 --queue default
``` ```
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent Example: spin two agents, pulling from dedicated `dual_gpu` queue, two GPUs per agent
```bash ```bash
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu
@ -223,14 +223,14 @@ For debug and experimentation, start the ClearML agent in `foreground` mode, whe
clearml-agent daemon --queue default --docker --foreground clearml-agent daemon --queue default --docker --foreground
``` ```
For actual service mode, all the stdout will be stored automatically into a file (no need to pipe) For actual service mode, all the stdout will be stored automatically into a file (no need to pipe).
Notice: with `--detached` flag, the *clearml-agent* will be running in the background Notice: with `--detached` flag, the *clearml-agent* will run in the background
```bash ```bash
clearml-agent daemon --detached --queue default --docker clearml-agent daemon --detached --queue default --docker
``` ```
Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 Example: spin two agents, one per gpu on the same machine, with default `nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04`
docker: docker:
```bash ```bash
@ -238,8 +238,8 @@ clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10
clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
``` ```
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent, with default nvidia/cuda: Example: spin two agents, pulling from dedicated `dual_gpu` queue, two GPUs per agent, with default
10.1-cudnn7-runtime-ubuntu18.04 docker: `nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04` docker:
```bash ```bash
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
@ -250,16 +250,16 @@ clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda
Priority Queues are also supported, example use case: Priority Queues are also supported, example use case:
High priority queue: `important_jobs` Low priority queue: `default` High priority queue: `important_jobs`, low priority queue: `default`
```bash ```bash
clearml-agent daemon --queue important_jobs default clearml-agent daemon --queue important_jobs default
``` ```
The **ClearML Agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from The **ClearML Agent** will first try to pull jobs from the `important_jobs` queue, and only if it is empty, the agent
the `default` queue. will try to pull from the `default` queue.
Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see Adding queues, managing job order within a queue, and moving jobs between queues, is available using the Web UI, see
example on our [free server](https://app.clear.ml/workers-and-queues/queues) example on our [free server](https://app.clear.ml/workers-and-queues/queues)
##### Stopping the ClearML Agent ##### Stopping the ClearML Agent
@ -279,32 +279,33 @@ clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10
- Git repository link and commit ID (or an entire jupyter notebook) - Git repository link and commit ID (or an entire jupyter notebook)
- Git diff (were not saying you never commit and push, but still...) - Git diff (were not saying you never commit and push, but still...)
- Python packages used by your code (including specific versions used) - Python packages used by your code (including specific versions used)
- Hyper-Parameters - Hyperparameters
- Input Artifacts - Input artifacts
You now have a 'template' of your experiment with everything required for automated execution You now have a 'template' of your experiment with everything required for automated execution
* In the ClearML UI, Right-click on the experiment and select 'clone'. A copy of your experiment will be created. * In the ClearML UI, right-click on the experiment and select 'clone'. A copy of your experiment will be created.
* You now have a new draft experiment cloned from your original experiment, feel free to edit it * You now have a new draft experiment cloned from your original experiment, feel free to edit it
- Change the Hyper-Parameters - Change the hyperparameters
- Switch to the latest code base of the repository - Switch to the latest code base of the repository
- Update package versions - Update package versions
- Select a specific docker image to run in (see docker execution mode section) - Select a specific docker image to run in (see docker execution mode section)
- Or simply change nothing to run the same experiment again... - Or simply change nothing to run the same experiment again...
* Schedule the newly created experiment for execution: Right-click the experiment and select 'enqueue' * Schedule the newly created experiment for execution: right-click the experiment and select 'enqueue'
### ClearML-Agent Services Mode <a name="services"></a> ### ClearML-Agent Services Mode <a name="services"></a>
ClearML-Agent Services is a special mode of ClearML-Agent that provides the ability to launch long-lasting jobs that ClearML-Agent Services is a special mode of ClearML-Agent that provides the ability to launch long-lasting jobs that
previously had to be executed on local / dedicated machines. It allows a single agent to launch multiple dockers (Tasks) previously had to be executed on local / dedicated machines. It allows a single agent to launch multiple dockers (Tasks)
for different use cases. To name a few use cases, auto-scaler service (spinning instances when the need arises and the for different use cases:
budget allows), Controllers (Implementing pipelines and more sophisticated DevOps logic), Optimizer (such as * Auto-scaler service (spinning instances when the need arises and the budget allows)
Hyper-parameter Optimization or sweeping), and Application (such as interactive Bokeh apps for increased data * Controllers (Implementing pipelines and more sophisticated DevOps logic)
transparency) * Optimizer (such as Hyperparameter Optimization or sweeping)
* Application (such as interactive Bokeh apps for increased data transparency)
ClearML-Agent Services mode will spin **any** task enqueued into the specified queue. Every task launched by ClearML-Agent Services mode will spin **any** task enqueued into the specified queue. Every task launched by
ClearML-Agent Services will be registered as a new node in the system, providing tracking and transparency capabilities. ClearML-Agent Services will be registered as a new node in the system, providing tracking and transparency capabilities.
Currently clearml-agent in services-mode supports cpu only configuration. ClearML-agent services mode can be launched Currently, clearml-agent in services-mode supports CPU only configuration. ClearML-Agent services mode can be launched
alongside GPU agents. alongside GPU agents.
```bash ```bash
@ -321,15 +322,15 @@ ClearML package.
Sample AutoML & Orchestration examples can be found in the Sample AutoML & Orchestration examples can be found in the
ClearML [example/automation](https://github.com/allegroai/clearml/tree/master/examples/automation) folder. ClearML [example/automation](https://github.com/allegroai/clearml/tree/master/examples/automation) folder.
AutoML examples AutoML examples:
- [Toy Keras training experiment](https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py) - [Toy Keras training experiment](https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py)
- In order to create an experiment-template in the system, this code must be executed once manually - In order to create an experiment-template in the system, this code must be executed once manually
- [Random Search over the above Keras experiment-template](https://github.com/allegroai/clearml/blob/master/examples/automation/manual_random_param_search_example.py) - [Random Search over the above Keras experiment-template](https://github.com/allegroai/clearml/blob/master/examples/automation/manual_random_param_search_example.py)
- This example will create multiple copies of the Keras experiment-template, with different hyper-parameter - This example will create multiple copies of the Keras experiment-template, with different hyperparameter
combinations combinations
Experiment Pipeline examples Experiment Pipeline examples:
- [First step experiment](https://github.com/allegroai/clearml/blob/master/examples/automation/task_piping_example.py) - [First step experiment](https://github.com/allegroai/clearml/blob/master/examples/automation/task_piping_example.py)
- This example will "process data", and once done, will launch a copy of the 'second step' experiment-template - This example will "process data", and once done, will launch a copy of the 'second step' experiment-template