Documentation

This commit is contained in:
allegroai 2019-10-29 21:21:42 +02:00
parent b1d5b87bde
commit 1690132ff7

View File

@ -1,7 +1,7 @@
# TRAINS Agent
## Deep Learning DevOps For Everyone
"All the Deep-Learning DevOps your research needs, and then some... Beacuse ain't nobody got time for that"
"All the Deep-Learning DevOps your research needs, and then some... Because ain't nobody got time for that"
[![GitHub license](https://img.shields.io/github/license/allegroai/trains-agent.svg)](https://img.shields.io/github/license/allegroai/trains-agent.svg)
[![PyPI pyversions](https://img.shields.io/pypi/pyversions/trains-agent.svg)](https://img.shields.io/pypi/pyversions/trains-agent.svg)
@ -10,19 +10,19 @@
TRAINS Agent is an AI experiment cluster solution.
It is a zero configuration fire-and-forget execution agent and combined with trains-server it is a full AI cluster solution.
It is a zero configuration fire-and-forget execution agent, which combined with trains-server provides a full AI cluster solution.
**Full AutoML in 5 steps**
1. Install [trains-server](https://github.com/allegroai/trains-agent) (or use our [open server](https://demoapp.trains.allegro.ai))
2. `pip install trains_agent` ([install](#installing-the-trains-agent) on any GPU machine: on-premises / cloud / ...)
3. Add [trains](https://github.com/allegroai/trains) 2 lines to your code & run it once (on your machine / laptop)
4. Change the [parameters](#using-the-trains-agent) in the UI & send for [execution](#using-the-trains-agent) (or automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
1. Install the [TRAINS server](https://github.com/allegroai/trains-agent) (or use our [open server](https://demoapp.trains.allegro.ai))
2. `pip install trains_agent` ([install](#installing-the-trains-agent) the TRAINS agent on any GPU machine: on-premises / cloud / ...)
3. Add [TRAINS](https://github.com/allegroai/trains) to your code with just 2 lines & run it once (on your machine / laptop)
4. Change the [parameters](#using-the-trains-agent) in the UI & schedule for [execution](#using-the-trains-agent) (or automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
5. :chart_with_downwards_trend: :chart_with_upwards_trend: :eyes: :beer:
**Using the TRAINS Agent, you can now setup a dynamic cluster with \*epsilon DevOps**
**Using the TRAINS agent, you can now set up a dynamic cluster with \*epsilon DevOps**
*epsilon - because we are scientists :triangular_ruler: and nothing is really zero work
*epsilon - Because we are scientists :triangular_ruler: and nothing is really zero work
(Experience TRAINS live at [https://demoapp.trains.allegro.ai](https://demoapp.trains.allegro.ai))
<a href="https://demoapp.trains.allegro.ai"><img src="https://raw.githubusercontent.com/allegroai/trains-agent/9f1e86c1ca45c984ee13edc9353c7b10c55d7257/docs/screenshots.gif" width="100%"></a>
@ -32,7 +32,7 @@ It is a zero configuration fire-and-forget execution agent and combined with tra
* Easily add & remove machines from the cluster
* Reuse machines without the need for any dedicated containers or images
* **Combine on-prem GPU resources with any cloud GPU resources**
* **Combine GPU resources across any cloud and on-prem**
* **No need for yaml/json/template configuration of any kind**
* **User friendly UI**
* Manageable resource allocation that can be used by researchers and engineers
@ -41,13 +41,15 @@ It is a zero configuration fire-and-forget execution agent and combined with tra
## But ... K8S?
We think Kubernetes is awesome. Combined with KubeFlow it is a robust solution for production-grade DevOps.
However, we observed that it can be a bit of an overkill as an R&D DL/ML solution.
We think Kubernetes is awesome.
Combined with KubeFlow it is a robust solution for production-grade DevOps.
We've observed, however, that it can be a bit of an overkill as an R&D DL/ML solution.
If you are considering K8S for your research, also consider that you will soon be managing **hundreds** of containers...
In our experience, handling and building the pipelines, having to package every experiment in a docker, managing those hundreds (or more) containers and building pipelines on top of it all, is very complicated (also, its usually out of scope for the research team, and overwhelming even for the DevOps team).
In our experience, handling and building the environments, having to package every experiment in a docker, managing those hundreds (or more) containers and building pipelines on top of it all, is very complicated (also, its usually out of scope for the research team, and overwhelming even for the DevOps team).
We feel there has to be a better way, that can be just as powerful for R&D and at the same time allow integration with K8S **when the need arises**. If you already have a K8S cluster for AI, detailed instructions on how to integrate TRAINS into your K8S cluster are *coming soon*.
We feel there has to be a better way, that can be just as powerful for R&D and at the same time allow integration with K8S **when the need arises**.
(If you already have a K8S cluster for AI, detailed instructions on how to integrate TRAINS into your K8S cluster are *coming soon*.)
## Using the TRAINS Agent
@ -66,7 +68,7 @@ A previously run experiment can be put into 'Draft' state by either of two metho
An experiment is scheduled for execution using the **'Enqueue'** action from the experiment
right-click context menu in the TRAINS UI and selecting the execution queue.
See [creating an experiment, and enqueuing it for execution](#from-scratch).
See [creating an experiment and enqueuing it for execution](#from-scratch).
Once an experiment is enqueued, it will be picked up and executed by a TRAINS agent monitoring this queue.
@ -84,8 +86,8 @@ The TRAINS agent executes experiments using the following process:
- Create a new virtual environment (or launch the selected docker image)
- Clone the code into the virtual-environment (or inside the docker)
- Install python packages based on the package requirements listed for the experiment
- Special note for PyTorch, The TRAINS agent will automatically select the
torch packages based on the CUDA_VERSION environment of the machine
- Special note for PyTorch: The TRAINS agent will automatically select the
torch packages based on the CUDA_VERSION environment variable of the machine
- Execute the code, while monitoring the process
- Log all stdout/stderr in the TRAINS UI, including the cloning and installation process, for easy debugging
- Monitor the execution and allow you to manually abort the job using the TRAINS UI (or, in the unfortunate case of a code crash, catch the error and signal the experiment has failed)
@ -237,10 +239,10 @@ The **TRAINS agent** will first try to pull jobs from the `important_jobs` queue
- Update package versions
- Select a specific docker image to run in (see docker execution mode section)
- Or simply change nothing to run the same experiment again...
* Send the newly created experiment for execution, right-click the experiment and select 'enqueue'
* Schedule the newly created experiment for execution: Right-click the experiment and select 'enqueue'
# AutoML and Orchestration Pipelines <a name="automl-pipes"></a>
The TRAINS Agent can also implement AutoML orchestration and Experiment Pipelines in conjunction with the TRAINS package.
The TRAINS Agent can also be used to implement AutoML orchestration and Experiment Pipelines in conjunction with the TRAINS package.
Sample AutoML & Orchestration examples can be found in the TRAINS [example/automl](https://github.com/allegroai/trains/tree/master/examples/automl) folder.