mirror of
https://github.com/clearml/clearml-agent
synced 2025-03-13 15:08:44 +00:00
Documentation
This commit is contained in:
parent
6fef58df6c
commit
1736d205bb
88
README.md
88
README.md
@ -1,4 +1,4 @@
|
|||||||
# TRAINS Agent
|
# Allegro Trains Agent
|
||||||
## Deep Learning DevOps For Everyone - Now supporting all platforms (Linux, macOS, and Windows)
|
## Deep Learning DevOps For Everyone - Now supporting all platforms (Linux, macOS, and Windows)
|
||||||
|
|
||||||
"All the Deep-Learning DevOps your research needs, and then some... Because ain't nobody got time for that"
|
"All the Deep-Learning DevOps your research needs, and then some... Because ain't nobody got time for that"
|
||||||
@ -10,27 +10,27 @@
|
|||||||
|
|
||||||
### Help improve Trains by filling our 2-min [user survey](https://allegro.ai/lp/trains-user-survey/)
|
### Help improve Trains by filling our 2-min [user survey](https://allegro.ai/lp/trains-user-survey/)
|
||||||
|
|
||||||
**TRAINS Agent is an AI experiment cluster solution.**
|
**Trains Agent is an AI experiment cluster solution.**
|
||||||
|
|
||||||
It is a zero configuration fire-and-forget execution agent, which combined with trains-server provides a full AI cluster solution.
|
It is a zero configuration fire-and-forget execution agent, which combined with trains-server provides a full AI cluster solution.
|
||||||
|
|
||||||
**Full AutoML in 5 steps**
|
**Full AutoML in 5 steps**
|
||||||
1. Install the [TRAINS server](https://github.com/allegroai/trains-agent) (or use our [open server](https://demoapp.trains.allegro.ai))
|
1. Install the [Trains Server](https://github.com/allegroai/trains-agent) (or use our [open server](https://demoapp.trains.allegro.ai))
|
||||||
2. `pip install trains-agent` ([install](#installing-the-trains-agent) the TRAINS agent on any GPU machine: on-premises / cloud / ...)
|
2. `pip install trains-agent` ([install](#installing-the-trains-agent) the Trains Agent on any GPU machine: on-premises / cloud / ...)
|
||||||
3. Add [TRAINS](https://github.com/allegroai/trains) to your code with just 2 lines & run it once (on your machine / laptop)
|
3. Add [Trains](https://github.com/allegroai/trains) to your code with just 2 lines & run it once (on your machine / laptop)
|
||||||
4. Change the [parameters](#using-the-trains-agent) in the UI & schedule for [execution](#using-the-trains-agent) (or automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
|
4. Change the [parameters](#using-the-trains-agent) in the UI & schedule for [execution](#using-the-trains-agent) (or automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
|
||||||
5. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes: :beer:
|
5. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes: :beer:
|
||||||
|
|
||||||
|
|
||||||
**Using the TRAINS agent, you can now set up a dynamic cluster with \*epsilon DevOps**
|
**Using the Trains Agent, you can now set up a dynamic cluster with \*epsilon DevOps**
|
||||||
|
|
||||||
*epsilon - Because we are scientists :triangular_ruler: and nothing is really zero work
|
*epsilon - Because we are scientists :triangular_ruler: and nothing is really zero work
|
||||||
|
|
||||||
(Experience TRAINS live at [https://demoapp.trains.allegro.ai](https://demoapp.trains.allegro.ai))
|
(Experience Trains live at [https://demoapp.trains.allegro.ai](https://demoapp.trains.allegro.ai))
|
||||||
<a href="https://demoapp.trains.allegro.ai"><img src="https://raw.githubusercontent.com/allegroai/trains-agent/9f1e86c1ca45c984ee13edc9353c7b10c55d7257/docs/screenshots.gif" width="100%"></a>
|
<a href="https://demoapp.trains.allegro.ai"><img src="https://raw.githubusercontent.com/allegroai/trains-agent/9f1e86c1ca45c984ee13edc9353c7b10c55d7257/docs/screenshots.gif" width="100%"></a>
|
||||||
|
|
||||||
## Simple, Flexible Experiment Orchestration
|
## Simple, Flexible Experiment Orchestration
|
||||||
**The TRAINS Agent was built to address the DL/ML R&D DevOps needs:**
|
**The Trains Agent was built to address the DL/ML R&D DevOps needs:**
|
||||||
|
|
||||||
* Easily add & remove machines from the cluster
|
* Easily add & remove machines from the cluster
|
||||||
* Reuse machines without the need for any dedicated containers or images
|
* Reuse machines without the need for any dedicated containers or images
|
||||||
@ -51,30 +51,30 @@ If you are considering K8S for your research, also consider that you will soon b
|
|||||||
In our experience, handling and building the environments, having to package every experiment in a docker, managing those hundreds (or more) containers and building pipelines on top of it all, is very complicated (also, it’s usually out of scope for the research team, and overwhelming even for the DevOps team).
|
In our experience, handling and building the environments, having to package every experiment in a docker, managing those hundreds (or more) containers and building pipelines on top of it all, is very complicated (also, it’s usually out of scope for the research team, and overwhelming even for the DevOps team).
|
||||||
|
|
||||||
We feel there has to be a better way, that can be just as powerful for R&D and at the same time allow integration with K8S **when the need arises**.
|
We feel there has to be a better way, that can be just as powerful for R&D and at the same time allow integration with K8S **when the need arises**.
|
||||||
(If you already have a K8S cluster for AI, detailed instructions on how to integrate TRAINS into your K8S cluster are [here](https://github.com/allegroai/trains-server-k8s/tree/master/trains-server-chart) with included [helm chart](https://github.com/allegroai/trains-server-helm))
|
(If you already have a K8S cluster for AI, detailed instructions on how to integrate Trains into your K8S cluster are [here](https://github.com/allegroai/trains-server-k8s/tree/master/trains-server-chart) with included [helm chart](https://github.com/allegroai/trains-server-helm))
|
||||||
|
|
||||||
|
|
||||||
## Using the TRAINS Agent
|
## Using the Trains Agent
|
||||||
**Full scale HPC with a click of a button**
|
**Full scale HPC with a click of a button**
|
||||||
|
|
||||||
TRAINS Agent is a job scheduler that listens on job queue(s), pulls jobs, sets the job environments, executes the job and monitors its progress.
|
The Trains Agent is a job scheduler that listens on job queue(s), pulls jobs, sets the job environments, executes the job and monitors its progress.
|
||||||
|
|
||||||
Any 'Draft' experiment can be scheduled for execution by a TRAINS agent.
|
Any 'Draft' experiment can be scheduled for execution by a Trains agent.
|
||||||
|
|
||||||
A previously run experiment can be put into 'Draft' state by either of two methods:
|
A previously run experiment can be put into 'Draft' state by either of two methods:
|
||||||
* Using the **'Reset'** action from the experiment right-click context menu in the
|
* Using the **'Reset'** action from the experiment right-click context menu in the
|
||||||
TRAINS UI - This will clear any results and artifacts the previous run had created.
|
Trains UI - This will clear any results and artifacts the previous run had created.
|
||||||
* Using the **'Clone'** action from the experiment right-click context menu in the
|
* Using the **'Clone'** action from the experiment right-click context menu in the
|
||||||
TRAINS UI - This will create a new 'Draft' experiment with the same configuration as the original experiment.
|
Trains UI - This will create a new 'Draft' experiment with the same configuration as the original experiment.
|
||||||
|
|
||||||
An experiment is scheduled for execution using the **'Enqueue'** action from the experiment
|
An experiment is scheduled for execution using the **'Enqueue'** action from the experiment
|
||||||
right-click context menu in the TRAINS UI and selecting the execution queue.
|
right-click context menu in the Trains UI and selecting the execution queue.
|
||||||
|
|
||||||
See [creating an experiment and enqueuing it for execution](#from-scratch).
|
See [creating an experiment and enqueuing it for execution](#from-scratch).
|
||||||
|
|
||||||
Once an experiment is enqueued, it will be picked up and executed by a TRAINS agent monitoring this queue.
|
Once an experiment is enqueued, it will be picked up and executed by a Trains agent monitoring this queue.
|
||||||
|
|
||||||
The TRAINS UI Workers & Queues page provides ongoing execution information:
|
The Trains UI Workers & Queues page provides ongoing execution information:
|
||||||
- Workers Tab: Monitor you cluster
|
- Workers Tab: Monitor you cluster
|
||||||
- Review available resources
|
- Review available resources
|
||||||
- Monitor machines statistics (CPU / GPU / Disk / Network)
|
- Monitor machines statistics (CPU / GPU / Disk / Network)
|
||||||
@ -83,16 +83,16 @@ The TRAINS UI Workers & Queues page provides ongoing execution information:
|
|||||||
- Cancel or abort job execution
|
- Cancel or abort job execution
|
||||||
- Move jobs between execution queues
|
- Move jobs between execution queues
|
||||||
|
|
||||||
### What The TRAINS Agent Actually Does
|
### What The Trains Agent Actually Does
|
||||||
The TRAINS agent executes experiments using the following process:
|
The Trains Agent executes experiments using the following process:
|
||||||
- Create a new virtual environment (or launch the selected docker image)
|
- Create a new virtual environment (or launch the selected docker image)
|
||||||
- Clone the code into the virtual-environment (or inside the docker)
|
- Clone the code into the virtual-environment (or inside the docker)
|
||||||
- Install python packages based on the package requirements listed for the experiment
|
- Install python packages based on the package requirements listed for the experiment
|
||||||
- Special note for PyTorch: The TRAINS agent will automatically select the
|
- Special note for PyTorch: The Trains Agent will automatically select the
|
||||||
torch packages based on the CUDA_VERSION environment variable of the machine
|
torch packages based on the CUDA_VERSION environment variable of the machine
|
||||||
- Execute the code, while monitoring the process
|
- Execute the code, while monitoring the process
|
||||||
- Log all stdout/stderr in the TRAINS UI, including the cloning and installation process, for easy debugging
|
- Log all stdout/stderr in the Trains UI, including the cloning and installation process, for easy debugging
|
||||||
- Monitor the execution and allow you to manually abort the job using the TRAINS UI (or, in the unfortunate case of a code crash, catch the error and signal the experiment has failed)
|
- Monitor the execution and allow you to manually abort the job using the Trains UI (or, in the unfortunate case of a code crash, catch the error and signal the experiment has failed)
|
||||||
|
|
||||||
### System Design & Flow
|
### System Design & Flow
|
||||||
```text
|
```text
|
||||||
@ -100,24 +100,24 @@ The TRAINS agent executes experiments using the following process:
|
|||||||
| GPU Machine |
|
| GPU Machine |
|
||||||
Development Machine | |
|
Development Machine | |
|
||||||
+------------------------+ | +-------------+ |
|
+------------------------+ | +-------------+ |
|
||||||
| Data Scientist's | +--------------+ | |TRAINS Agent | |
|
| Data Scientist's | +--------------+ | |Trains Agent | |
|
||||||
| DL/ML Code | | WEB UI | | | | |
|
| DL/ML Code | | WEB UI | | | | |
|
||||||
| | | | | | +---------+ | |
|
| | | | | | +---------+ | |
|
||||||
| | | | | | | DL/ML | | |
|
| | | | | | | DL/ML | | |
|
||||||
| | +--------------+ | | | Code | | |
|
| | +--------------+ | | | Code | | |
|
||||||
| | User Clones Exp #1 / . . . . . . . / | | | | | |
|
| | User Clones Exp #1 / . . . . . . . / | | | | | |
|
||||||
| +-------------------+ | into Exp #2 / . . . . . . . / | | +---------+ | |
|
| +-------------------+ | into Exp #2 / . . . . . . . / | | +---------+ | |
|
||||||
| | TRAINS | | +---------------/-_____________-/ | | | |
|
| | Trains | | +---------------/-_____________-/ | | | |
|
||||||
| +---------+---------+ | | | | ^ | |
|
| +---------+---------+ | | | | ^ | |
|
||||||
+-----------|------------+ | | +------|------+ |
|
+-----------|------------+ | | +------|------+ |
|
||||||
| | +--------|--------+
|
| | +--------|--------+
|
||||||
Auto-Magically | |
|
Auto-Magically | |
|
||||||
Creates Exp #1 | The TRAINS Agent
|
Creates Exp #1 | The Trains Agent
|
||||||
\ User Change Hyper-Parameters Pulls Exp #2, setup the
|
\ User Change Hyper-Parameters Pulls Exp #2, setup the
|
||||||
| | environment & clone code.
|
| | environment & clone code.
|
||||||
| | Start execution with the
|
| | Start execution with the
|
||||||
+------------|------------+ | +--------------------+ new set of Hyper-Parameters.
|
+------------|------------+ | +--------------------+ new set of Hyper-Parameters.
|
||||||
| +---------v---------+ | | | TRAINS-SERVER | |
|
| +---------v---------+ | | | Trains Server | |
|
||||||
| | Experiment #1 | | | | | |
|
| | Experiment #1 | | | | | |
|
||||||
| +-------------------+ | | | Execution Queue | |
|
| +-------------------+ | | | Execution Queue | |
|
||||||
| || | | | | |
|
| || | | | | |
|
||||||
@ -128,17 +128,17 @@ Development Machine |
|
|||||||
| | ------------->---------------+ | |
|
| | ------------->---------------+ | |
|
||||||
| | User Send Exp #2 | |Execute Exp #2 +--------------------+
|
| | User Send Exp #2 | |Execute Exp #2 +--------------------+
|
||||||
| | For Execution | +---------------+ |
|
| | For Execution | +---------------+ |
|
||||||
| TRAINS-SERVER | | |
|
| Trains Server | | |
|
||||||
+-------------------------+ +--------------------+
|
+-------------------------+ +--------------------+
|
||||||
```
|
```
|
||||||
|
|
||||||
### Installing the TRAINS Agent
|
### Installing the Trains Agent
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install trains-agent
|
pip install trains-agent
|
||||||
```
|
```
|
||||||
|
|
||||||
### TRAINS Agent Usage Examples
|
### Trains Agent Usage Examples
|
||||||
|
|
||||||
Full Interface and capabilities are available with
|
Full Interface and capabilities are available with
|
||||||
```bash
|
```bash
|
||||||
@ -146,22 +146,22 @@ trains-agent --help
|
|||||||
trains-agent daemon --help
|
trains-agent daemon --help
|
||||||
```
|
```
|
||||||
|
|
||||||
### Configuring the TRAINS Agent
|
### Configuring the Trains Agent
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
trains-agent init
|
trains-agent init
|
||||||
```
|
```
|
||||||
|
|
||||||
Note: The TRAINS agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default TRAINS Agent cache folder is `~/.trains`
|
Note: The Trains Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default Trains Agent cache folder is `~/.trains`
|
||||||
|
|
||||||
See full details in your configuration file at `~/trains.conf`
|
See full details in your configuration file at `~/trains.conf`
|
||||||
|
|
||||||
Note: The **TRAINS agent** extends the **TRAINS** configuration file `~/trains.conf`
|
Note: The **Trains agent** extends the **Trains** configuration file `~/trains.conf`
|
||||||
They are designed to share the same configuration file, see example [here](docs/trains.conf)
|
They are designed to share the same configuration file, see example [here](docs/trains.conf)
|
||||||
|
|
||||||
### Running the TRAINS Agent
|
### Running the Trains Agent
|
||||||
|
|
||||||
For debug and experimentation, start the TRAINS agent in `foreground` mode, where all the output is printed to screen
|
For debug and experimentation, start the Trains agent in `foreground` mode, where all the output is printed to screen
|
||||||
```bash
|
```bash
|
||||||
trains-agent daemon --queue default --foreground
|
trains-agent daemon --queue default --foreground
|
||||||
```
|
```
|
||||||
@ -190,9 +190,9 @@ trains-agent daemon --detached --gpus 0,1 --queue dual_gpu
|
|||||||
trains-agent daemon --detached --gpus 2,3 --queue dual_gpu
|
trains-agent daemon --detached --gpus 2,3 --queue dual_gpu
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Starting the TRAINS Agent in docker mode
|
#### Starting the Trains Agent in docker mode
|
||||||
|
|
||||||
For debug and experimentation, start the TRAINS agent in `foreground` mode, where all the output is printed to screen
|
For debug and experimentation, start the Trains agent in `foreground` mode, where all the output is printed to screen
|
||||||
```bash
|
```bash
|
||||||
trains-agent daemon --queue default --docker --foreground
|
trains-agent daemon --queue default --docker --foreground
|
||||||
```
|
```
|
||||||
@ -215,7 +215,7 @@ trains-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda
|
|||||||
trains-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda
|
trains-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Starting the TRAINS Agent - Priority Queues
|
#### Starting the Trains Agent - Priority Queues
|
||||||
|
|
||||||
Priority Queues are also supported, example use case:
|
Priority Queues are also supported, example use case:
|
||||||
|
|
||||||
@ -223,14 +223,14 @@ High priority queue: `important_jobs` Low priority queue: `default`
|
|||||||
```bash
|
```bash
|
||||||
trains-agent daemon --queue important_jobs default
|
trains-agent daemon --queue important_jobs default
|
||||||
```
|
```
|
||||||
The **TRAINS agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from the `default` queue.
|
The **Trains Agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from the `default` queue.
|
||||||
|
|
||||||
Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see example on our [open server](https://demoapp.trains.allegro.ai/workers-and-queues/queues)
|
Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see example on our [open server](https://demoapp.trains.allegro.ai/workers-and-queues/queues)
|
||||||
|
|
||||||
## How do I create an experiment on the TRAINS server? <a name="from-scratch"></a>
|
## How do I create an experiment on the Trains Server? <a name="from-scratch"></a>
|
||||||
* Integrate [TRAINS](https://github.com/allegroai/trains) with your code
|
* Integrate [Trains](https://github.com/allegroai/trains) with your code
|
||||||
* Execute the code on your machine (Manually / PyCharm / Jupyter Notebook)
|
* Execute the code on your machine (Manually / PyCharm / Jupyter Notebook)
|
||||||
* As your code is running, **TRAINS** creates an experiment logging all the necessary execution information:
|
* As your code is running, **Trains** creates an experiment logging all the necessary execution information:
|
||||||
- Git repository link and commit ID (or an entire jupyter notebook)
|
- Git repository link and commit ID (or an entire jupyter notebook)
|
||||||
- Git diff (we’re not saying you never commit and push, but still...)
|
- Git diff (we’re not saying you never commit and push, but still...)
|
||||||
- Python packages used by your code (including specific versions used)
|
- Python packages used by your code (including specific versions used)
|
||||||
@ -239,7 +239,7 @@ Adding queues, managing job order within a queue and moving jobs between queues,
|
|||||||
|
|
||||||
You now have a 'template' of your experiment with everything required for automated execution
|
You now have a 'template' of your experiment with everything required for automated execution
|
||||||
|
|
||||||
* In the TRAINS UI, Right click on the experiment and select 'clone'. A copy of your experiment will be created.
|
* In the Trains UI, Right click on the experiment and select 'clone'. A copy of your experiment will be created.
|
||||||
* You now have a new draft experiment cloned from your original experiment, feel free to edit it
|
* You now have a new draft experiment cloned from your original experiment, feel free to edit it
|
||||||
- Change the Hyper-Parameters
|
- Change the Hyper-Parameters
|
||||||
- Switch to the latest code base of the repository
|
- Switch to the latest code base of the repository
|
||||||
@ -270,9 +270,9 @@ trains-agent daemon --services-mode --detached --queue services --create-queue -
|
|||||||
|
|
||||||
|
|
||||||
## AutoML and Orchestration Pipelines <a name="automl-pipes"></a>
|
## AutoML and Orchestration Pipelines <a name="automl-pipes"></a>
|
||||||
The TRAINS Agent can also be used to implement AutoML orchestration and Experiment Pipelines in conjunction with the TRAINS package.
|
The Trains Agent can also be used to implement AutoML orchestration and Experiment Pipelines in conjunction with the Trains package.
|
||||||
|
|
||||||
Sample AutoML & Orchestration examples can be found in the TRAINS [example/automl](https://github.com/allegroai/trains/tree/master/examples/automl) folder.
|
Sample AutoML & Orchestration examples can be found in the Trains [example/automl](https://github.com/allegroai/trains/tree/master/examples/automl) folder.
|
||||||
|
|
||||||
AutoML examples
|
AutoML examples
|
||||||
- [Toy Keras training experiment](https://github.com/allegroai/trains/blob/master/examples/automl/automl_base_template_keras_simple.py)
|
- [Toy Keras training experiment](https://github.com/allegroai/trains/blob/master/examples/automl/automl_base_template_keras_simple.py)
|
||||||
|
Loading…
Reference in New Issue
Block a user