clearml-agent/README.md

# Allegro Trains Agent
## Deep Learning DevOps For Everyone - Now supporting all platforms (Linux, macOS, and Windows)

"All the Deep-Learning DevOps your research needs, and then some... Because ain't nobody got time for that"

[![GitHub license](https://img.shields.io/github/license/allegroai/trains-agent.svg)](https://img.shields.io/github/license/allegroai/trains-agent.svg)
[![PyPI pyversions](https://img.shields.io/pypi/pyversions/trains-agent.svg)](https://img.shields.io/pypi/pyversions/trains-agent.svg)
[![PyPI version shields.io](https://img.shields.io/pypi/v/trains-agent.svg)](https://img.shields.io/pypi/v/trains-agent.svg)
[![PyPI status](https://img.shields.io/pypi/status/trains-agent.svg)](https://pypi.python.org/pypi/trains-agent/)

### Help improve Trains by filling our 2-min [user survey](https://allegro.ai/lp/trains-user-survey/)

**Trains Agent is an AI experiment cluster solution.**

It is a zero configuration fire-and-forget execution agent, which combined with trains-server provides a full AI cluster solution.

**Full AutoML in 5 steps** 
1. Install the [Trains Server](https://github.com/allegroai/trains-agent) (or use our [open server](https://demoapp.trains.allegro.ai))
2. `pip install trains-agent` ([install](#installing-the-trains-agent) the Trains Agent on any GPU machine: on-premises / cloud / ...)
3. Add [Trains](https://github.com/allegroai/trains) to your code with just 2 lines & run it once (on your machine / laptop)
4. Change the [parameters](#using-the-trains-agent) in the UI & schedule for [execution](#using-the-trains-agent) (or automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
5. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes:  :beer:


**Using the Trains Agent, you can now set up a dynamic cluster with \*epsilon DevOps**

*epsilon - Because we are scientists :triangular_ruler: and nothing is really zero work

(Experience Trains live at [https://demoapp.trains.allegro.ai](https://demoapp.trains.allegro.ai))
<a href="https://demoapp.trains.allegro.ai"><img src="https://raw.githubusercontent.com/allegroai/trains-agent/9f1e86c1ca45c984ee13edc9353c7b10c55d7257/docs/screenshots.gif" width="100%"></a>

## Simple, Flexible Experiment Orchestration
**The Trains Agent was built to address the DL/ML R&D DevOps needs:**

* Easily add & remove machines from the cluster
* Reuse machines without the need for any dedicated containers or images
* **Combine GPU resources across any cloud and on-prem**
* **No need for yaml/json/template configuration of any kind**
* **User friendly UI**
* Manageable resource allocation that can be used by researchers and engineers
* Flexible and controllable scheduler with priority support
* Automatic instance spinning in the cloud **(coming soon)**


## But ... K8S?
We think Kubernetes is awesome.  
Combined with KubeFlow it is a robust solution for production-grade DevOps.  
We've observed, however, that it can be a bit of an overkill as an R&D DL/ML solution.
If you are considering K8S for your research, also consider that you will soon be managing **hundreds** of containers...

In our experience, handling and building the environments, having to package every experiment in a docker, managing those hundreds (or more) containers and building pipelines on top of it all, is very complicated (also, it’s usually out of scope for the research team, and overwhelming even for the DevOps team).

We feel there has to be a better way, that can be just as powerful for R&D and at the same time allow integration with K8S **when the need arises**.  
(If you already have a K8S cluster for AI, detailed instructions on how to integrate Trains into your K8S cluster are [here](https://github.com/allegroai/trains-server-k8s/tree/master/trains-server-chart) with included [helm chart](https://github.com/allegroai/trains-server-helm))


## Using the Trains Agent
**Full scale HPC with a click of a button**

The Trains Agent is a job scheduler that listens on job queue(s), pulls jobs, sets the job environments, executes the job and monitors its progress.

Any 'Draft' experiment can be scheduled for execution by a Trains agent.

A previously run experiment can be put into 'Draft' state by either of two methods:
* Using the **'Reset'** action from the experiment right-click context menu in the
  Trains UI - This will clear any results and artifacts the previous run had created.
* Using the **'Clone'** action from the experiment right-click context menu in the
  Trains UI - This will create a new 'Draft' experiment with the same configuration as the original experiment.

An experiment is scheduled for execution using the **'Enqueue'** action from the experiment
 right-click context menu in the Trains UI and selecting the execution queue.

See [creating an experiment and enqueuing it for execution](#from-scratch).

Once an experiment is enqueued, it will be picked up and executed by a Trains agent monitoring this queue.

The Trains UI Workers & Queues page provides ongoing execution information:
  - Workers Tab: Monitor you cluster
    - Review available resources
    - Monitor machines statistics (CPU / GPU / Disk / Network)
  - Queues Tab:
    - Control the scheduling order of jobs
    - Cancel or abort job execution
    - Move jobs between execution queues

### What The Trains Agent Actually Does
The Trains Agent executes experiments using the following process:
  - Create a new virtual environment (or launch the selected docker image)
  - Clone the code into the virtual-environment (or inside the docker)
  - Install python packages based on the package requirements listed for the experiment
    - Special note for PyTorch: The Trains Agent will automatically select the
      torch packages based on the CUDA_VERSION environment variable of the machine
  - Execute the code, while monitoring the process
  - Log all stdout/stderr in the Trains UI, including the cloning and installation process, for easy debugging
  - Monitor the execution and allow you to manually abort the job using the Trains UI (or, in the unfortunate case of a code crash, catch the error and signal the experiment has failed)

### System Design & Flow
```text
                                                                              +-----------------+
                                                                              |  GPU  Machine   |
Development Machine                                                           |                 |
+------------------------+                                                    | +-------------+ |
|    Data Scientist's    |                            +--------------+        | |Trains Agent | |
|      DL/ML Code        |                            |    WEB UI    |        | |             | |
|                        |                            |              |        | | +---------+ | |
|                        |                            |              |        | | |  DL/ML  | | |
|                        |                            +--------------+        | | |  Code   | | |
|                        |       User Clones Exp #1  / . . . . . . . /        | | |         | | |
| +-------------------+  |           into Exp #2    / . . . . . . . /         | | +---------+ | |
| |      Trains       |  |         +---------------/-_____________-/          | |             | |
| +---------+---------+  |         |                                          | |      ^      | |
+-----------|------------+         |                                          | +------|------+ |
            |                      |                                          +--------|--------+
 Auto-Magically                    |                                                   |
 Creates Exp #1                    |                                      The Trains Agent
             \          User Change Hyper-Parameters                      Pulls Exp #2, setup the
             |                     |                                      environment & clone code.
             |                     |                                      Start execution with the
+------------|------------+        |            +--------------------+    new set of Hyper-Parameters.
|  +---------v---------+  |        |            |   Trains Server    |                 |
|  | Experiment #1     |  |        |            |                    |                 |
|  +-------------------+  |        |            |  Execution Queue   |                 |
|            ||           |        |            |                    |                 |
|  +-------------------+<----------+            |                    |                 |
|  |                   |  |                     |                    |                 |
|  | Experiment #2     |  |                     |                    |                 |
|  +-------------------<------------\           |                    |                 |
|                         |          ------------->---------------+  |                 |
|                         |  User Send Exp #2   | |Execute Exp #2 +--------------------+
|                         |  For Execution      | +---------------+  |
|     Trains Server       |                     |                    |
+-------------------------+                     +--------------------+
```

### Installing the Trains Agent

```bash
pip install trains-agent
```

### Trains Agent Usage Examples

Full Interface and capabilities are available with
```bash
trains-agent --help
trains-agent daemon --help
```

### Configuring the Trains Agent

```bash
trains-agent init
```

Note: The Trains Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default Trains Agent cache folder is `~/.trains`

See full details in your configuration file at `~/trains.conf`

Note: The **Trains agent** extends the **Trains** configuration file `~/trains.conf`
They are designed to share the same configuration file, see example [here](docs/trains.conf)

### Running the Trains Agent

For debug and experimentation, start the Trains agent in `foreground` mode, where all the output is printed to screen
```bash
trains-agent daemon --queue default --foreground
```

For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe)
Notice: with `--detached` flag, the *trains-agent* will be running in the background
```bash
trains-agent daemon --detached --queue default
```

GPU allocation is controlled via the standard OS environment `NVIDIA_VISIBLE_DEVICES` or `--gpus` flag (or disabled with `--cpu-only`).

If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPU's will be allocated for the `trains-agent` <br>
If `--cpu-only` flag is set, or `NVIDIA_VISIBLE_DEVICES` is an empty string (""), no gpu will be allocated for the `trains-agent`

Example: spin two agents, one per gpu on the same machine:
Notice: with `--detached` flag, the *trains-agent* will be running in the background
```bash
trains-agent daemon --detached --gpus 0 --queue default
trains-agent daemon --detached --gpus 1 --queue default
```

Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent
```bash
trains-agent daemon --detached --gpus 0,1 --queue dual_gpu
trains-agent daemon --detached --gpus 2,3 --queue dual_gpu
```

#### Starting the Trains Agent in docker mode

For debug and experimentation, start the Trains agent in `foreground` mode, where all the output is printed to screen
```bash
trains-agent daemon --queue default --docker --foreground
```

For actual service mode, all the stdout will be stored automatically into a file (no need to pipe)
Notice: with `--detached` flag, the *trains-agent* will be running in the background
```bash
trains-agent daemon --detached --queue default --docker
```

Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda docker:
```bash
trains-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda
trains-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda
```

Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent, with default nvidia/cuda docker:
```bash
trains-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda
trains-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda
```

#### Starting the Trains Agent - Priority Queues

Priority Queues are also supported, example use case:

High priority queue: `important_jobs`  Low priority queue: `default`
```bash
trains-agent daemon --queue important_jobs default
```
The **Trains Agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from the `default` queue.

Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see example on our [open server](https://demoapp.trains.allegro.ai/workers-and-queues/queues)

## How do I create an experiment on the Trains Server? <a name="from-scratch"></a>
* Integrate [Trains](https://github.com/allegroai/trains) with your code
* Execute the code on your machine (Manually / PyCharm / Jupyter Notebook)
* As your code is running, **Trains** creates an experiment logging all the necessary execution information:
  - Git repository link and commit ID (or an entire jupyter notebook)
  - Git diff (we’re not saying you never commit and push, but still...)
  - Python packages used by your code (including specific versions used)
  - Hyper-Parameters
  - Input Artifacts

  You now have a 'template' of your experiment with everything required for automated execution

* In the Trains UI, Right click on the experiment and select 'clone'. A copy of your experiment will be created.
* You now have a new draft experiment cloned from your original experiment, feel free to edit it
  - Change the Hyper-Parameters
  - Switch to the latest code base of the repository
  - Update package versions
  - Select a specific docker image to run in (see docker execution mode section)
  - Or simply change nothing to run the same experiment again...
* Schedule the newly created experiment for execution: Right-click the experiment and select 'enqueue'

## Trains-Agent Services Mode <a name="services"></a> 

Trains-Agent Services is a special mode of Trains-Agent that provides the ability to launch long-lasting jobs 
that previously had to be executed on local / dedicated machines. It allows a single agent to 
launch multiple dockers (Tasks) for different use cases. To name a few use cases, auto-scaler service (spinning instances 
when the need arises and the budget allows), Controllers (Implementing pipelines and more sophisticated DevOps logic),
Optimizer (such as Hyper-parameter Optimization or sweeping), and Application (such as interactive Bokeh apps for 
increased data transparency)

Trains-Agent Services mode will spin **any** task enqueued into the specified queue. 
Every task launched by Trains-Agent Services will be registered as a new node in the system, 
providing tracking and transparency capabilities. 
Currently trains-agent in services-mode supports cpu only configuration. Trains-agent services mode can be launched alongside GPU agents.

```bash
trains-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only
```

**Note**: It is the user's responsibility to make sure the proper tasks are pushed into the specified queue. 


## AutoML and Orchestration Pipelines <a name="automl-pipes"></a>
The Trains Agent can also be used to implement AutoML orchestration and Experiment Pipelines in conjunction with the Trains package.

Sample AutoML & Orchestration examples can be found in the Trains [example/automation](https://github.com/allegroai/trains/tree/master/examples/automation) folder.

AutoML examples
  - [Toy Keras training experiment](https://github.com/allegroai/trains/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py)
    - In order to create an experiment-template in the system, this code must be executed once manually
  - [Random Search over the above Keras experiment-template](https://github.com/allegroai/trains/blob/master/examples/automation/manual_random_param_search_example.py)
    - This example will create multiple copies of the Keras experiment-template, with different hyper-parameter combinations

Experiment Pipeline examples
  - [First step experiment](https://github.com/allegroai/trains/blob/master/examples/automation/task_piping_example.py)
    - This example will "process data", and once done, will launch a copy of the 'second step' experiment-template
  - [Second step experiment](https://github.com/allegroai/trains/blob/master/examples/automation/toy_base_task.py)
    - In order to create an experiment-template in the system, this code must be executed once manually

## License

Apache License, Version 2.0 (see the [LICENSE](https://www.apache.org/licenses/LICENSE-2.0.html) for more information)
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								# Allegro Trains Agent
-												Update README.md
											
										
										
											2019-12-24 20:52:30 +00:00
+								## Deep Learning DevOps For Everyone - Now supporting all platforms (Linux, macOS, and Windows)
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
-												Documentation

											
										
										
											2019-10-29 19:21:42 +00:00
+								"All the Deep-Learning DevOps your research needs, and then some... Because ain't nobody got time for that"
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
 								[![GitHub license](https://img.shields.io/github/license/allegroai/trains-agent.svg)](https://img.shields.io/github/license/allegroai/trains-agent.svg)
 								[![PyPI pyversions](https://img.shields.io/pypi/pyversions/trains-agent.svg)](https://img.shields.io/pypi/pyversions/trains-agent.svg)
 								[![PyPI version shields.io](https://img.shields.io/pypi/v/trains-agent.svg)](https://img.shields.io/pypi/v/trains-agent.svg)
 								[![PyPI status](https://img.shields.io/pypi/status/trains-agent.svg)](https://pypi.python.org/pypi/trains-agent/)
-												Update README

											
										
										
											2020-05-09 17:12:53 +00:00
+								### Help improve Trains by filling our 2-min [user survey](https://allegro.ai/lp/trains-user-survey/)
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								**Trains Agent is an AI experiment cluster solution.**
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
-												Documentation

											
										
										
											2019-10-29 19:21:42 +00:00
+								It is a zero configuration fire-and-forget execution agent, which combined with trains-server provides a full AI cluster solution.
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
-												Update README.md
											
										
										
											2019-10-29 11:18:51 +00:00
+								**Full AutoML in 5 steps**
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+. Install the [Trains Server](https://github.com/allegroai/trains-agent) (or use our [open server](https://demoapp.trains.allegro.ai))
 . `pip install trains-agent` ([install](#installing-the-trains-agent) the Trains Agent on any GPU machine: on-premises / cloud / ...)
 . Add [Trains](https://github.com/allegroai/trains) to your code with just 2 lines & run it once (on your machine / laptop)
-												Documentation

											
										
										
											2019-10-29 19:21:42 +00:00
+. Change the [parameters](#using-the-trains-agent) in the UI & schedule for [execution](#using-the-trains-agent) (or automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
-												Update README.md
											
										
										
											2019-10-29 11:18:51 +00:00
+. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes:  :beer:
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								**Using the Trains Agent, you can now set up a dynamic cluster with \*epsilon DevOps**
-												Update README.md
											
										
										
											2019-10-29 11:18:51 +00:00
-												Documentation

											
										
										
											2019-10-29 19:21:42 +00:00
+								*epsilon - Because we are scientists :triangular_ruler: and nothing is really zero work
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								(Experience Trains live at [https://demoapp.trains.allegro.ai](https://demoapp.trains.allegro.ai))
-												Update README.md
											
										
										
											2019-10-29 03:42:28 +00:00
+								<a href="https://demoapp.trains.allegro.ai"><img src="https://raw.githubusercontent.com/allegroai/trains-agent/9f1e86c1ca45c984ee13edc9353c7b10c55d7257/docs/screenshots.gif" width="100%"></a>
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
 								## Simple, Flexible Experiment Orchestration
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								**The Trains Agent was built to address the DL/ML R&D DevOps needs:**
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
 								* Easily add & remove machines from the cluster
 								* Reuse machines without the need for any dedicated containers or images
-												Documentation

											
										
										
											2019-10-29 19:21:42 +00:00
+								* **Combine GPU resources across any cloud and on-prem**
-												Update README.md
											
										
										
											2019-10-29 01:57:15 +00:00
+								* **No need for yaml/json/template configuration of any kind**
 								* **User friendly UI**
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								* Manageable resource allocation that can be used by researchers and engineers
 								* Flexible and controllable scheduler with priority support
 								* Automatic instance spinning in the cloud **(coming soon)**
-												Update README.md
											
										
										
											2019-10-29 01:57:15 +00:00
-												Update README.md
											
										
										
											2019-10-29 12:53:16 +00:00
+								## But ... K8S?
-												Documentation

											
										
										
											2019-10-29 19:21:42 +00:00
+								We think Kubernetes is awesome.
 								Combined with KubeFlow it is a robust solution for production-grade DevOps.
 								We've observed, however, that it can be a bit of an overkill as an R&D DL/ML solution.
-												Update README.md
											
										
										
											2019-10-29 12:53:16 +00:00
+								If you are considering K8S for your research, also consider that you will soon be managing **hundreds** of containers...
-												Documentation

											
										
										
											2019-10-29 19:21:42 +00:00
+								In our experience, handling and building the environments, having to package every experiment in a docker, managing those hundreds (or more) containers and building pipelines on top of it all, is very complicated (also, it’s usually out of scope for the research team, and overwhelming even for the DevOps team).
-												Update README.md
											
										
										
											2019-10-29 12:53:16 +00:00
-												Documentation

											
										
										
											2019-10-29 19:21:42 +00:00
+								We feel there has to be a better way, that can be just as powerful for R&D and at the same time allow integration with K8S **when the need arises**.
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								(If you already have a K8S cluster for AI, detailed instructions on how to integrate Trains into your K8S cluster are [here](https://github.com/allegroai/trains-server-k8s/tree/master/trains-server-chart) with included [helm chart](https://github.com/allegroai/trains-server-helm))
-												Update README.md
											
										
										
											2019-10-29 01:57:15 +00:00
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								## Using the Trains Agent
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								**Full scale HPC with a click of a button**
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								The Trains Agent is a job scheduler that listens on job queue(s), pulls jobs, sets the job environments, executes the job and monitors its progress.
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								Any 'Draft' experiment can be scheduled for execution by a Trains agent.
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
 								A previously run experiment can be put into 'Draft' state by either of two methods:
 								* Using the **'Reset'** action from the experiment right-click context menu in the
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								  Trains UI - This will clear any results and artifacts the previous run had created.
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								* Using the **'Clone'** action from the experiment right-click context menu in the
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								  Trains UI - This will create a new 'Draft' experiment with the same configuration as the original experiment.
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
 								An experiment is scheduled for execution using the **'Enqueue'** action from the experiment
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								 right-click context menu in the Trains UI and selecting the execution queue.
-												Documentation

											
										
										
											2019-10-29 03:30:56 +00:00
-												Documentation

											
										
										
											2019-10-29 19:21:42 +00:00
+								See [creating an experiment and enqueuing it for execution](#from-scratch).
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								Once an experiment is enqueued, it will be picked up and executed by a Trains agent monitoring this queue.
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								The Trains UI Workers & Queues page provides ongoing execution information:
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								  - Workers Tab: Monitor you cluster
 								    - Review available resources
 								    - Monitor machines statistics (CPU / GPU / Disk / Network)
-												Documentation

											
										
										
											2019-10-29 03:30:56 +00:00
+								  - Queues Tab:
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								    - Control the scheduling order of jobs
 								    - Cancel or abort job execution
 								    - Move jobs between execution queues
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								### What The Trains Agent Actually Does
 								The Trains Agent executes experiments using the following process:
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								  - Create a new virtual environment (or launch the selected docker image)
 								  - Clone the code into the virtual-environment (or inside the docker)
 								  - Install python packages based on the package requirements listed for the experiment
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								    - Special note for PyTorch: The Trains Agent will automatically select the
-												Documentation

											
										
										
											2019-10-29 19:21:42 +00:00
+								      torch packages based on the CUDA_VERSION environment variable of the machine
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								  - Execute the code, while monitoring the process
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								  - Log all stdout/stderr in the Trains UI, including the cloning and installation process, for easy debugging
 								  - Monitor the execution and allow you to manually abort the job using the Trains UI (or, in the unfortunate case of a code crash, catch the error and signal the experiment has failed)
-												Documentation

											
										
										
											2019-10-29 03:30:56 +00:00
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								### System Design & Flow
 								```text
-												Documentation

											
										
										
											2019-10-29 03:30:56 +00:00
+								                                                                              +-----------------+
 								                                                                              |  GPU  Machine   |
 								Development Machine                                                           |                 |
 								+------------------------+                                                    | +-------------+ |
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								|    Data Scientist's    |                            +--------------+        | |Trains Agent | |
-												Documentation

											
										
										
											2019-10-29 03:30:56 +00:00
+								|      DL/ML Code        |                            |    WEB UI    |        | |             | |
 								|                        |                            |              |        | | +---------+ | |
 								|                        |                            |              |        | | |  DL/ML  | | |
 								|                        |                            +--------------+        | | |  Code   | | |
 								|                        |       User Clones Exp #1  / . . . . . . . /        | | |         | | |
 								| +-------------------+  |           into Exp #2    / . . . . . . . /         | | +---------+ | |
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								| |      Trains       |  |         +---------------/-_____________-/          | |             | |
-												Documentation

											
										
										
											2019-10-29 03:30:56 +00:00
+								| +---------+---------+  |         |                                          | |      ^      | |
 								+-----------|------------+         |                                          | +------|------+ |
 								            |                      |                                          +--------|--------+
 								 Auto-Magically                    |                                                   |
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								 Creates Exp #1                    |                                      The Trains Agent
-												Documentation

											
										
										
											2019-10-29 03:30:56 +00:00
+								             \          User Change Hyper-Parameters                      Pulls Exp #2, setup the
-												Update README.md
											
										
										
											2019-10-29 03:28:56 +00:00
+								             |                     |                                      environment & clone code.
-												Documentation

											
										
										
											2019-10-29 03:30:56 +00:00
+								             |                     |                                      Start execution with the
-												Update README.md
											
										
										
											2019-10-29 03:28:56 +00:00
+								+------------|------------+        |            +--------------------+    new set of Hyper-Parameters.
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								|  +---------v---------+  |        |            |   Trains Server    |                 |
-												Documentation

											
										
										
											2019-10-29 03:30:56 +00:00
+								|  | Experiment #1     |  |        |            |                    |                 |
 								|  +-------------------+  |        |            |  Execution Queue   |                 |
-												Update README.md
											
										
										
											2019-10-29 03:28:56 +00:00
+								|            ||           |        |            |                    |                 |
 								|  +-------------------+<----------+            |                    |                 |
 								|  |                   |  |                     |                    |                 |
 								|  | Experiment #2     |  |                     |                    |                 |
 								|  +-------------------<------------\           |                    |                 |
 								|                         |          ------------->---------------+  |                 |
 								|                         |  User Send Exp #2   | |Execute Exp #2 +--------------------+
-												Documentation

											
										
										
											2019-10-29 03:30:56 +00:00
+								|                         |  For Execution      | +---------------+  |
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								|     Trains Server       |                     |                    |
-												Documentation

											
										
										
											2019-10-29 03:30:56 +00:00
+								+-------------------------+                     +--------------------+
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								```
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								### Installing the Trains Agent
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
 								```bash
-												Documentation

											
										
										
											2019-12-24 10:47:35 +00:00
+								pip install trains-agent
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								```
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								### Trains Agent Usage Examples
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
 								Full Interface and capabilities are available with
 								```bash
 								trains-agent --help
 								trains-agent daemon --help
 								```
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								### Configuring the Trains Agent
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
 								```bash
 								trains-agent init
 								```
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								Note: The Trains Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default Trains Agent cache folder is `~/.trains`
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
 								See full details in your configuration file at `~/trains.conf`
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								Note: The **Trains agent** extends the **Trains** configuration file `~/trains.conf`
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								They are designed to share the same configuration file, see example [here](docs/trains.conf)
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								### Running the Trains Agent
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								For debug and experimentation, start the Trains agent in `foreground` mode, where all the output is printed to screen
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								```bash
 								trains-agent daemon --queue default --foreground
 								```
 								For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe)
-												Update README

											
										
										
											2020-05-09 17:12:53 +00:00
+								Notice: with `--detached` flag, the *trains-agent* will be running in the background
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								```bash
-												Update README

											
										
										
											2020-05-09 17:12:53 +00:00
+								trains-agent daemon --detached --queue default
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								```
-												Documentation

											
										
										
											2019-11-15 21:25:08 +00:00
+								GPU allocation is controlled via the standard OS environment `NVIDIA_VISIBLE_DEVICES` or `--gpus` flag (or disabled with `--cpu-only`).
-												Update README.md
											
										
										
											2019-10-29 16:06:35 +00:00
-												Documentation

											
										
										
											2019-11-15 21:25:08 +00:00
+								If no flag is set, and `NVIDIA_VISIBLE_DEVICES` variable doesn't exist, all GPU's will be allocated for the `trains-agent` <br>
 								If `--cpu-only` flag is set, or `NVIDIA_VISIBLE_DEVICES` is an empty string (""), no gpu will be allocated for the `trains-agent`
-												Update README.md
											
										
										
											2019-10-29 16:06:35 +00:00
 								Example: spin two agents, one per gpu on the same machine:
-												Update README

											
										
										
											2020-05-09 17:12:53 +00:00
+								Notice: with `--detached` flag, the *trains-agent* will be running in the background
-												Update README.md
											
										
										
											2019-10-29 16:06:35 +00:00
+								```bash
-												Update README

											
										
										
											2020-05-09 17:12:53 +00:00
+								trains-agent daemon --detached --gpus 0 --queue default
 								trains-agent daemon --detached --gpus 1 --queue default
-												Update README.md
											
										
										
											2019-10-29 16:06:35 +00:00
+								```
-												Update README.md
											
										
										
											2019-10-30 11:15:32 +00:00
+								Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent
-												Update README.md
											
										
										
											2019-10-29 16:06:35 +00:00
+								```bash
-												Update README

											
										
										
											2020-05-09 17:12:53 +00:00
+								trains-agent daemon --detached --gpus 0,1 --queue dual_gpu
 								trains-agent daemon --detached --gpus 2,3 --queue dual_gpu
-												Update README.md
											
										
										
											2019-10-29 16:06:35 +00:00
+								```
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								#### Starting the Trains Agent in docker mode
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								For debug and experimentation, start the Trains agent in `foreground` mode, where all the output is printed to screen
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								```bash
 								trains-agent daemon --queue default --docker --foreground
 								```
 								For actual service mode, all the stdout will be stored automatically into a file (no need to pipe)
-												Update README

											
										
										
											2020-05-09 17:12:53 +00:00
+								Notice: with `--detached` flag, the *trains-agent* will be running in the background
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								```bash
-												Update README

											
										
										
											2020-05-09 17:12:53 +00:00
+								trains-agent daemon --detached --queue default --docker
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								```
-												Update README.md
											
										
										
											2019-10-30 11:15:32 +00:00
+								Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda docker:
-												Update README.md
											
										
										
											2019-10-29 16:06:35 +00:00
+								```bash
-												Update README

											
										
										
											2020-05-09 17:12:53 +00:00
+								trains-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda
 								trains-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda
-												Update README.md
											
										
										
											2019-10-29 16:06:35 +00:00
+								```
-												Update README.md
											
										
										
											2019-10-30 11:15:32 +00:00
+								Example: spin two agents, pulling from dedicated `dual_gpu` queue, two gpu's per agent, with default nvidia/cuda docker:
-												Update README.md
											
										
										
											2019-10-29 16:06:35 +00:00
+								```bash
-												Update README

											
										
										
											2020-05-09 17:12:53 +00:00
+								trains-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda
 								trains-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda
-												Update README.md
											
										
										
											2019-10-29 16:06:35 +00:00
+								```
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								#### Starting the Trains Agent - Priority Queues
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
-												Documentation

											
										
										
											2019-10-29 03:30:56 +00:00
+								Priority Queues are also supported, example use case:
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
 								High priority queue: `important_jobs`  Low priority queue: `default`
 								```bash
 								trains-agent daemon --queue important_jobs default
 								```
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								The **Trains Agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from the `default` queue.
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
-												Update README.md
											
										
										
											2019-10-30 11:15:32 +00:00
+								Adding queues, managing job order within a queue and moving jobs between queues, is available using the Web UI, see example on our [open server](https://demoapp.trains.allegro.ai/workers-and-queues/queues)
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								## How do I create an experiment on the Trains Server? <a name="from-scratch"></a>
 								* Integrate [Trains](https://github.com/allegroai/trains) with your code
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								* Execute the code on your machine (Manually / PyCharm / Jupyter Notebook)
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								* As your code is running, **Trains** creates an experiment logging all the necessary execution information:
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								  - Git repository link and commit ID (or an entire jupyter notebook)
 								  - Git diff (we’re not saying you never commit and push, but still...)
 								  - Python packages used by your code (including specific versions used)
 								  - Hyper-Parameters
 								  - Input Artifacts
-												Documentation

											
										
										
											2019-10-29 03:30:56 +00:00
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								  You now have a 'template' of your experiment with everything required for automated execution
-												Documentation

											
										
										
											2019-10-29 03:30:56 +00:00
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								* In the Trains UI, Right click on the experiment and select 'clone'. A copy of your experiment will be created.
-												Add README.md

											
										
										
											2019-10-29 01:34:40 +00:00
+								* You now have a new draft experiment cloned from your original experiment, feel free to edit it
 								  - Change the Hyper-Parameters
 								  - Switch to the latest code base of the repository
 								  - Update package versions
 								  - Select a specific docker image to run in (see docker execution mode section)
 								  - Or simply change nothing to run the same experiment again...
-												Documentation

											
										
										
											2019-10-29 19:21:42 +00:00
+								* Schedule the newly created experiment for execution: Right-click the experiment and select 'enqueue'
-												Update README.md
											
										
										
											2019-10-29 16:06:35 +00:00
-												Update README.md
											
										
										
											2020-06-01 21:58:52 +00:00
+								## Trains-Agent Services Mode <a name="services"></a>
 								Trains-Agent Services is a special mode of Trains-Agent that provides the ability to launch long-lasting jobs
 								that previously had to be executed on local / dedicated machines. It allows a single agent to
 								launch multiple dockers (Tasks) for different use cases. To name a few use cases, auto-scaler service (spinning instances
 								when the need arises and the budget allows), Controllers (Implementing pipelines and more sophisticated DevOps logic),
 								Optimizer (such as Hyper-parameter Optimization or sweeping), and Application (such as interactive Bokeh apps for
 								increased data transparency)
 								Trains-Agent Services mode will spin **any** task enqueued into the specified queue.
-												Update README.md
											
										
										
											2020-06-01 22:02:34 +00:00
+								Every task launched by Trains-Agent Services will be registered as a new node in the system,
-												Update README.md
											
										
										
											2020-06-01 21:58:52 +00:00
+								providing tracking and transparency capabilities.
 								Currently trains-agent in services-mode supports cpu only configuration. Trains-agent services mode can be launched alongside GPU agents.
 								```bash
 								trains-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only
 								```
 								**Note**: It is the user's responsibility to make sure the proper tasks are pushed into the specified queue.
 								## AutoML and Orchestration Pipelines <a name="automl-pipes"></a>
-												Documentation

											
										
										
											2020-06-17 21:31:44 +00:00
+								The Trains Agent can also be used to implement AutoML orchestration and Experiment Pipelines in conjunction with the Trains package.
-												Update README.md
											
										
										
											2019-10-29 16:06:35 +00:00
-												Fix Trains examples references

											
										
										
											2020-10-15 20:28:53 +00:00
+								Sample AutoML & Orchestration examples can be found in the Trains [example/automation](https://github.com/allegroai/trains/tree/master/examples/automation) folder.
-												Update README.md
											
										
										
											2019-10-29 16:06:35 +00:00
 								AutoML examples
-												Fix Trains examples references

											
										
										
											2020-10-15 20:28:53 +00:00
+								  - [Toy Keras training experiment](https://github.com/allegroai/trains/blob/master/examples/optimization/hyper-parameter-optimization/base_template_keras_simple.py)
-												Update README.md
											
										
										
											2019-10-29 16:06:35 +00:00
+								    - In order to create an experiment-template in the system, this code must be executed once manually
-												Fix Trains examples references

											
										
										
											2020-10-15 20:28:53 +00:00
+								  - [Random Search over the above Keras experiment-template](https://github.com/allegroai/trains/blob/master/examples/automation/manual_random_param_search_example.py)
-												Update README.md
											
										
										
											2019-10-29 16:06:35 +00:00
+								    - This example will create multiple copies of the Keras experiment-template, with different hyper-parameter combinations
 								Experiment Pipeline examples
-												:pencil: Broken links in README.md

											
										
										
											2020-10-14 07:43:33 +00:00
+								  - [First step experiment](https://github.com/allegroai/trains/blob/master/examples/automation/task_piping_example.py)
-												Update README.md
											
										
										
											2019-10-29 16:06:35 +00:00
+								    - This example will "process data", and once done, will launch a copy of the 'second step' experiment-template
-												:pencil: Broken links in README.md

											
										
										
											2020-10-14 07:43:33 +00:00
+								  - [Second step experiment](https://github.com/allegroai/trains/blob/master/examples/automation/toy_base_task.py)
-												Update README.md
											
										
										
											2019-10-29 16:06:35 +00:00
+								    - In order to create an experiment-template in the system, this code must be executed once manually
-												Update README

											
										
										
											2020-05-09 17:12:53 +00:00
-												Update README.md
											
										
										
											2020-06-01 21:58:52 +00:00
+								## License
-												Update README

											
										
										
											2020-05-09 17:12:53 +00:00
 								Apache License, Version 2.0 (see the [LICENSE](https://www.apache.org/licenses/LICENSE-2.0.html) for more information)