1. Install [trains-server](https://github.com/allegroai/trains-agent) (or use our [open server](https://demoapp.trains.allegro.ai))
2.`pip install trains_agent` ([install](#installing-the-trains-agent) on any GPU machine: on-premises / cloud / ...)
3. Add [trains](https://github.com/allegroai/trains) 2 lines to your code & run it once (on your machine / laptop)
4. Change the [parameters](#using-the-trains-agent) in the UI & send for [execution](#using-the-trains-agent) (or automate with an [AutoML pipeline](#automl-and-orchestration-pipelines-))
We think Kubernetes is awesome. Combined with KubeFlow it is a robust solution for production-grade DevOps.
However, we obsered that it can be a bit of an overkill as an R&D DL/ML solution.
If you are considering K8S for your research, also consider that you will soon be managing **hundreds** of containers...
In our experience, handling and building the pipelines, having to package every experiment in a docker, managing those hundreds (or more) containers and building pipelines on top of it all, is very complicated (also, it’s usually out of scope for the research team, and overwhelming even for the DevOps team).
We feel there has to be a better way, that can be just as powerful for R&D and at the same time allow integration with K8S **when the need arises**. If you already have a K8S cluster for AI, detailed instructions on how to integrate TRAINS into your K8S cluster are *coming soon*.
torch packages based on the CUDA_VERSION environment of the machine
- Execute the code, while monitoring the process
- Log all stdout/stderr in the TRAINS UI, including the cloning and installation process, for easy debugging
- Monitor the execution and allow you to manually abort the job using the TRAINS UI (or, in the unfortunate case of a code crash, catch the error and signal the experiment has failed)
Full Interface and capabilities are available with
```bash
trains-agent --help
trains-agent daemon --help
```
### Configuring the TRAINS Agent
```bash
trains-agent init
```
Note: The TRAINS agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default TRAINS Agent cache folder is `~/.trains`
See full details in your configuration file at `~/trains.conf`
Note: The **TRAINS agent** extends the **TRAINS** configuration file `~/trains.conf`
They are designed to share the same configuration file, see example [here](docs/trains.conf)
### Running the TRAINS Agent
For debug and experimentation, start the TRAINS agent in `foreground` mode, where all the output is printed to screen
```bash
trains-agent daemon --queue default --foreground
```
For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe)
```bash
trains-agent daemon --queue default
```
#### Starting the TRAINS Agent in docker mode
For debug and experimentation, start the TRAINS agent in `foreground` mode, where all the output is printed to screen
The **TRAINS agent** will first try to pull jobs from the `important_jobs` queue, only then it will fetch a job from the `default` queue.
# AutoML and Orchestration Pipelines <a name="automl-pipes"></a>
The TRAINS Agent can also implement AutoML orchestration and Experiment Pipelines in conjunction with the TRAINS package.
Sample AutoML & Orchestration examples can be found in the TRAINS [example/automl](https://github.com/allegroai/trains/tree/master/examples/automl) folder.
AutoML examples
- [Toy Keras training experiment](https://github.com/allegroai/trains/blob/master/examples/automl/automl_base_template_keras_simple.py)
- [Random Search over the above Keras experiment-template](https://github.com/allegroai/trains/blob/master/examples/automl/automl_random_search_example.py)
- This example will create multiple copies of the Keras experiment-template, with different hyper-parameter combinations