2020-06-17 21:31:44 +00:00
# Allegro Trains Agent
2019-12-24 20:52:30 +00:00
## Deep Learning DevOps For Everyone - Now supporting all platforms (Linux, macOS, and Windows)
2019-10-29 01:34:40 +00:00
2019-10-29 19:21:42 +00:00
"All the Deep-Learning DevOps your research needs, and then some... Because ain't nobody got time for that"
2019-10-29 01:34:40 +00:00
[![GitHub license ](https://img.shields.io/github/license/allegroai/trains-agent.svg )](https://img.shields.io/github/license/allegroai/trains-agent.svg)
[![PyPI pyversions ](https://img.shields.io/pypi/pyversions/trains-agent.svg )](https://img.shields.io/pypi/pyversions/trains-agent.svg)
[![PyPI version shields.io ](https://img.shields.io/pypi/v/trains-agent.svg )](https://img.shields.io/pypi/v/trains-agent.svg)
[![PyPI status ](https://img.shields.io/pypi/status/trains-agent.svg )](https://pypi.python.org/pypi/trains-agent/)
2020-05-09 17:12:53 +00:00
### Help improve Trains by filling our 2-min [user survey](https://allegro.ai/lp/trains-user-survey/)
2020-06-17 21:31:44 +00:00
**Trains Agent is an AI experiment cluster solution.**
2019-10-29 01:34:40 +00:00
2019-10-29 19:21:42 +00:00
It is a zero configuration fire-and-forget execution agent, which combined with trains-server provides a full AI cluster solution.
2019-10-29 01:34:40 +00:00
2019-10-29 11:18:51 +00:00
**Full AutoML in 5 steps**
2020-06-17 21:31:44 +00:00
1. Install the [Trains Server ](https://github.com/allegroai/trains-agent ) (or use our [open server ](https://demoapp.trains.allegro.ai ))
2. `pip install trains-agent` ([install](#installing-the-trains-agent) the Trains Agent on any GPU machine: on-premises / cloud / ...)
3. Add [Trains ](https://github.com/allegroai/trains ) to your code with just 2 lines & run it once (on your machine / laptop)
2019-10-29 19:21:42 +00:00
4. Change the [parameters ](#using-the-trains-agent ) in the UI & schedule for [execution ](#using-the-trains-agent ) (or automate with an [AutoML pipeline ](#automl-and-orchestration-pipelines- ))
2019-10-29 11:18:51 +00:00
5. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes: :beer:
2020-06-17 21:31:44 +00:00
**Using the Trains Agent, you can now set up a dynamic cluster with \*epsilon DevOps**
2019-10-29 11:18:51 +00:00
2019-10-29 19:21:42 +00:00
*epsilon - Because we are scientists :triangular_ruler: and nothing is really zero work
2019-10-29 01:34:40 +00:00
2020-06-17 21:31:44 +00:00
(Experience Trains live at [https://demoapp.trains.allegro.ai ](https://demoapp.trains.allegro.ai ))
2019-10-29 03:42:28 +00:00
< a href = "https://demoapp.trains.allegro.ai" > < img src = "https://raw.githubusercontent.com/allegroai/trains-agent/9f1e86c1ca45c984ee13edc9353c7b10c55d7257/docs/screenshots.gif" width = "100%" > < / a >
2019-10-29 01:34:40 +00:00
## Simple, Flexible Experiment Orchestration
2020-06-17 21:31:44 +00:00
**The Trains Agent was built to address the DL/ML R& D DevOps needs:**
2019-10-29 01:34:40 +00:00
* Easily add & remove machines from the cluster
* Reuse machines without the need for any dedicated containers or images
2019-10-29 19:21:42 +00:00
* **Combine GPU resources across any cloud and on-prem**
2019-10-29 01:57:15 +00:00
* **No need for yaml/json/template configuration of any kind**
* **User friendly UI**
2019-10-29 01:34:40 +00:00
* Manageable resource allocation that can be used by researchers and engineers
* Flexible and controllable scheduler with priority support
* Automatic instance spinning in the cloud ** (coming soon)**
2019-10-29 01:57:15 +00:00
2019-10-29 12:53:16 +00:00
## But ... K8S?
2019-10-29 19:21:42 +00:00
We think Kubernetes is awesome.
Combined with KubeFlow it is a robust solution for production-grade DevOps.
We've observed, however, that it can be a bit of an overkill as an R& D DL/ML solution.
2019-10-29 12:53:16 +00:00
If you are considering K8S for your research, also consider that you will soon be managing **hundreds** of containers...
2019-10-29 19:21:42 +00:00
In our experience, handling and building the environments, having to package every experiment in a docker, managing those hundreds (or more) containers and building pipelines on top of it all, is very complicated (also, it’ s usually out of scope for the research team, and overwhelming even for the DevOps team).
2019-10-29 12:53:16 +00:00
2019-10-29 19:21:42 +00:00
We feel there has to be a better way, that can be just as powerful for R& D and at the same time allow integration with K8S **when the need arises** .
2020-06-17 21:31:44 +00:00
(If you already have a K8S cluster for AI, detailed instructions on how to integrate Trains into your K8S cluster are [here ](https://github.com/allegroai/trains-server-k8s/tree/master/trains-server-chart ) with included [helm chart ](https://github.com/allegroai/trains-server-helm ))
2019-10-29 01:57:15 +00:00
2020-06-17 21:31:44 +00:00
## Using the Trains Agent
2019-10-29 01:34:40 +00:00
**Full scale HPC with a click of a button**
2020-06-17 21:31:44 +00:00
The Trains Agent is a job scheduler that listens on job queue(s), pulls jobs, sets the job environments, executes the job and monitors its progress.
2019-10-29 01:34:40 +00:00
2020-06-17 21:31:44 +00:00
Any 'Draft' experiment can be scheduled for execution by a Trains agent.
2019-10-29 01:34:40 +00:00
A previously run experiment can be put into 'Draft' state by either of two methods:
* Using the ** 'Reset'** action from the experiment right-click context menu in the
2020-06-17 21:31:44 +00:00
Trains UI - This will clear any results and artifacts the previous run had created.
2019-10-29 01:34:40 +00:00
* Using the ** 'Clone'** action from the experiment right-click context menu in the
2020-06-17 21:31:44 +00:00
Trains UI - This will create a new 'Draft' experiment with the same configuration as the original experiment.
2019-10-29 01:34:40 +00:00
An experiment is scheduled for execution using the ** 'Enqueue'** action from the experiment
2020-06-17 21:31:44 +00:00
right-click context menu in the Trains UI and selecting the execution queue.
2019-10-29 03:30:56 +00:00
2019-10-29 19:21:42 +00:00
See [creating an experiment and enqueuing it for execution ](#from-scratch ).
2019-10-29 01:34:40 +00:00
2020-06-17 21:31:44 +00:00
Once an experiment is enqueued, it will be picked up and executed by a Trains agent monitoring this queue.
2019-10-29 01:34:40 +00:00
2020-06-17 21:31:44 +00:00
The Trains UI Workers & Queues page provides ongoing execution information:
2019-10-29 01:34:40 +00:00
- Workers Tab: Monitor you cluster
- Review available resources
- Monitor machines statistics (CPU / GPU / Disk / Network)
2019-10-29 03:30:56 +00:00
- Queues Tab:
2019-10-29 01:34:40 +00:00
- Control the scheduling order of jobs
- Cancel or abort job execution
- Move jobs between execution queues
2020-06-17 21:31:44 +00:00
### What The Trains Agent Actually Does
The Trains Agent executes experiments using the following process:
2019-10-29 01:34:40 +00:00
- Create a new virtual environment (or launch the selected docker image)
- Clone the code into the virtual-environment (or inside the docker)
- Install python packages based on the package requirements listed for the experiment
2020-06-17 21:31:44 +00:00
- Special note for PyTorch: The Trains Agent will automatically select the
2019-10-29 19:21:42 +00:00
torch packages based on the CUDA_VERSION environment variable of the machine
2019-10-29 01:34:40 +00:00
- Execute the code, while monitoring the process
2020-06-17 21:31:44 +00:00
- Log all stdout/stderr in the Trains UI, including the cloning and installation process, for easy debugging
- Monitor the execution and allow you to manually abort the job using the Trains UI (or, in the unfortunate case of a code crash, catch the error and signal the experiment has failed)
2019-10-29 03:30:56 +00:00
2019-10-29 01:34:40 +00:00
### System Design & Flow
```text
2019-10-29 03:30:56 +00:00
+-----------------+
| GPU Machine |
Development Machine | |
+------------------------+ | +-------------+ |
2020-06-17 21:31:44 +00:00
| Data Scientist's | +--------------+ | |Trains Agent | |
2019-10-29 03:30:56 +00:00
| DL/ML Code | | WEB UI | | | | |
| | | | | | +---------+ | |
| | | | | | | DL/ML | | |
| | +--------------+ | | | Code | | |
| | User Clones Exp #1 / . . . . . . . / | | | | | |
| +-------------------+ | into Exp #2 / . . . . . . . / | | +---------+ | |
2020-06-17 21:31:44 +00:00
| | Trains | | +---------------/-_____________-/ | | | |
2019-10-29 03:30:56 +00:00
| +---------+---------+ | | | | ^ | |
+-----------|------------+ | | +------|------+ |
| | +--------|--------+
Auto-Magically | |
2020-06-17 21:31:44 +00:00
Creates Exp #1 | The Trains Agent
2019-10-29 03:30:56 +00:00
\ User Change Hyper-Parameters Pulls Exp #2 , setup the
2019-10-29 03:28:56 +00:00
| | environment & clone code.
2019-10-29 03:30:56 +00:00
| | Start execution with the
2019-10-29 03:28:56 +00:00
+------------|------------+ | +--------------------+ new set of Hyper-Parameters.
2020-06-17 21:31:44 +00:00
| +---------v---------+ | | | Trains Server | |
2019-10-29 03:30:56 +00:00
| | Experiment #1 | | | | | |
| +-------------------+ | | | Execution Queue | |
2019-10-29 03:28:56 +00:00
| || | | | | |
| +-------------------+< ---------- + | | |
| | | | | | |
| | Experiment #2 | | | | |
| +-------------------< ------------ \ | | |
| | ------------->---------------+ | |
| | User Send Exp #2 | |Execute Exp #2 +--------------------+
2019-10-29 03:30:56 +00:00
| | For Execution | +---------------+ |
2020-06-17 21:31:44 +00:00
| Trains Server | | |
2019-10-29 03:30:56 +00:00
+-------------------------+ +--------------------+
2019-10-29 01:34:40 +00:00
```
2020-06-17 21:31:44 +00:00
### Installing the Trains Agent
2019-10-29 01:34:40 +00:00
```bash
2019-12-24 10:47:35 +00:00
pip install trains-agent
2019-10-29 01:34:40 +00:00
```
2020-06-17 21:31:44 +00:00
### Trains Agent Usage Examples
2019-10-29 01:34:40 +00:00
Full Interface and capabilities are available with
```bash
trains-agent --help
trains-agent daemon --help
```
2020-06-17 21:31:44 +00:00
### Configuring the Trains Agent
2019-10-29 01:34:40 +00:00
```bash
trains-agent init
```
2020-06-17 21:31:44 +00:00
Note: The Trains Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default Trains Agent cache folder is `~/.trains`
2019-10-29 01:34:40 +00:00
See full details in your configuration file at `~/trains.conf`
2020-06-17 21:31:44 +00:00
Note: The **Trains agent** extends the **Trains** configuration file `~/trains.conf`
2019-10-29 01:34:40 +00:00
They are designed to share the same configuration file, see example [here ](docs/trains.conf )
2020-06-17 21:31:44 +00:00
### Running the Trains Agent
2019-10-29 01:34:40 +00:00
2020-06-17 21:31:44 +00:00
For debug and experimentation, start the Trains agent in `foreground` mode, where all the output is printed to screen
2019-10-29 01:34:40 +00:00
```bash
trains-agent daemon --queue default --foreground
```
For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe)
2020-05-09 17:12:53 +00:00
Notice: with `--detached` flag, the *trains-agent* will be running in the background
2019-10-29 01:34:40 +00:00
```bash
2020-05-09 17:12:53 +00:00
trains-agent daemon --detached --queue default
2019-10-29 01:34:40 +00:00
```
2019-11-15 21:25:08 +00:00
GPU allocation is controlled via the standard OS environment `NVIDIA_VISIBLE_DEVICES` or `--gpus` flag (or disabled with `--cpu-only` ).
2019-10-29 16:06:35 +00:00
2019-11-15 21:25:08 +00:00
If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPU's will be allocated for the `trains-agent` < br >
If `--cpu-only` flag is set, or `NVIDIA_VISIBLE_DEVICES` is an empty string (""), no gpu will be allocated for the `trains-agent`
2019-10-29 16:06:35 +00:00
Example: spin two agents, one per gpu on the same machine:
2020-05-09 17:12:53 +00:00
Notice: with `--detached` flag, the *trains-agent* will be running in the background
2019-10-29 16:06:35 +00:00
```bash
2020-05-09 17:12:53 +00:00
trains-agent daemon --detached --gpus 0 --queue default
trains-agent daemon --detached --gpus 1 --queue default
2019-10-29 16:06:35 +00:00
```
2019-10-30 11:15:32 +00:00
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent
2019-10-29 16:06:35 +00:00
```bash
2020-05-09 17:12:53 +00:00
trains-agent daemon --detached --gpus 0,1 --queue dual_gpu
trains-agent daemon --detached --gpus 2,3 --queue dual_gpu
2019-10-29 16:06:35 +00:00
```
2020-06-17 21:31:44 +00:00
#### Starting the Trains Agent in docker mode
2019-10-29 01:34:40 +00:00
2020-06-17 21:31:44 +00:00
For debug and experimentation, start the Trains agent in `foreground` mode, where all the output is printed to screen
2019-10-29 01:34:40 +00:00
```bash
trains-agent daemon --queue default --docker --foreground
```
For actual service mode, all the stdout will be stored automatically into a file (no need to pipe)
2020-05-09 17:12:53 +00:00
Notice: with `--detached` flag, the *trains-agent* will be running in the background
2019-10-29 01:34:40 +00:00
```bash
2020-05-09 17:12:53 +00:00
trains-agent daemon --detached --queue default --docker
2019-10-29 01:34:40 +00:00
```
2019-10-30 11:15:32 +00:00
Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda docker:
2019-10-29 16:06:35 +00:00
```bash
2020-05-09 17:12:53 +00:00
trains-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda
trains-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda
2019-10-29 16:06:35 +00:00
```
2019-10-30 11:15:32 +00:00
Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent, with default nvidia/cuda docker:
2019-10-29 16:06:35 +00:00
```bash
2020-05-09 17:12:53 +00:00
trains-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda
trains-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda
2019-10-29 16:06:35 +00:00
```
2020-06-17 21:31:44 +00:00
#### Starting the Trains Agent - Priority Queues
2019-10-29 01:34:40 +00:00
2019-10-29 03:30:56 +00:00
Priority Queues are also supported, example use case:
2019-10-29 01:34:40 +00:00
High priority queue: `important_jobs` Low priority queue: `default`
```bash
trains-agent daemon --queue important_jobs default
```
2020-06-17 21:31:44 +00:00
The **Trains Agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from the `default` queue.
2019-10-29 01:34:40 +00:00
2019-10-30 11:15:32 +00:00
Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see example on our [open server ](https://demoapp.trains.allegro.ai/workers-and-queues/queues )
2020-06-17 21:31:44 +00:00
## How do I create an experiment on the Trains Server? <a name="from-scratch"></a>
* Integrate [Trains ](https://github.com/allegroai/trains ) with your code
2019-10-29 01:34:40 +00:00
* Execute the code on your machine (Manually / PyCharm / Jupyter Notebook)
2020-06-17 21:31:44 +00:00
* As your code is running, **Trains** creates an experiment logging all the necessary execution information:
2019-10-29 01:34:40 +00:00
- Git repository link and commit ID (or an entire jupyter notebook)
- Git diff (we’ re not saying you never commit and push, but still...)
- Python packages used by your code (including specific versions used)
- Hyper-Parameters
- Input Artifacts
2019-10-29 03:30:56 +00:00
2019-10-29 01:34:40 +00:00
You now have a 'template' of your experiment with everything required for automated execution
2019-10-29 03:30:56 +00:00
2020-06-17 21:31:44 +00:00
* In the Trains UI, Right click on the experiment and select 'clone'. A copy of your experiment will be created.
2019-10-29 01:34:40 +00:00
* You now have a new draft experiment cloned from your original experiment, feel free to edit it
- Change the Hyper-Parameters
- Switch to the latest code base of the repository
- Update package versions
- Select a specific docker image to run in (see docker execution mode section)
- Or simply change nothing to run the same experiment again...
2019-10-29 19:21:42 +00:00
* Schedule the newly created experiment for execution: Right-click the experiment and select 'enqueue'
2019-10-29 16:06:35 +00:00
2020-06-01 21:58:52 +00:00
## Trains-Agent Services Mode <a name="services"></a>
Trains-Agent Services is a special mode of Trains-Agent that provides the ability to launch long-lasting jobs
that previously had to be executed on local / dedicated machines. It allows a single agent to
launch multiple dockers (Tasks) for different use cases. To name a few use cases, auto-scaler service (spinning instances
when the need arises and the budget allows), Controllers (Implementing pipelines and more sophisticated DevOps logic),
Optimizer (such as Hyper-parameter Optimization or sweeping), and Application (such as interactive Bokeh apps for
increased data transparency)
Trains-Agent Services mode will spin **any** task enqueued into the specified queue.
2020-06-01 22:02:34 +00:00
Every task launched by Trains-Agent Services will be registered as a new node in the system,
2020-06-01 21:58:52 +00:00
providing tracking and transparency capabilities.
Currently trains-agent in services-mode supports cpu only configuration. Trains-agent services mode can be launched alongside GPU agents.
```bash
trains-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only
```
**Note**: It is the user's responsibility to make sure the proper tasks are pushed into the specified queue.
## AutoML and Orchestration Pipelines <a name="automl-pipes"></a>
2020-06-17 21:31:44 +00:00
The Trains Agent can also be used to implement AutoML orchestration and Experiment Pipelines in conjunction with the Trains package.
2019-10-29 16:06:35 +00:00
2020-10-15 20:28:53 +00:00
Sample AutoML & Orchestration examples can be found in the Trains [example/automation ](https://github.com/allegroai/trains/tree/master/examples/automation ) folder.
2019-10-29 16:06:35 +00:00
AutoML examples
2020-10-15 20:28:53 +00:00
- [Toy Keras training experiment ](https://github.com/allegroai/trains/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py )
2019-10-29 16:06:35 +00:00
- In order to create an experiment-template in the system, this code must be executed once manually
2020-10-15 20:28:53 +00:00
- [Random Search over the above Keras experiment-template ](https://github.com/allegroai/trains/blob/master/examples/automation/manual_random_param_search_example.py )
2019-10-29 16:06:35 +00:00
- This example will create multiple copies of the Keras experiment-template, with different hyper-parameter combinations
Experiment Pipeline examples
2020-10-14 07:43:33 +00:00
- [First step experiment ](https://github.com/allegroai/trains/blob/master/examples/automation/task_piping_example.py )
2019-10-29 16:06:35 +00:00
- This example will "process data", and once done, will launch a copy of the 'second step' experiment-template
2020-10-14 07:43:33 +00:00
- [Second step experiment ](https://github.com/allegroai/trains/blob/master/examples/automation/toy_base_task.py )
2019-10-29 16:06:35 +00:00
- In order to create an experiment-template in the system, this code must be executed once manually
2020-05-09 17:12:53 +00:00
2020-06-01 21:58:52 +00:00
## License
2020-05-09 17:12:53 +00:00
Apache License, Version 2.0 (see the [LICENSE ](https://www.apache.org/licenses/LICENSE-2.0.html ) for more information)