clearml-server/README.md

258 lines
10 KiB
Markdown
Raw Normal View History

2019-06-10 21:24:35 +00:00
# TRAINS Server
2019-06-11 15:55:04 +00:00
2019-06-13 23:17:46 +00:00
## Auto-Magical Experiment Manager & Version Control for AI
2019-06-10 21:24:35 +00:00
2019-06-11 17:09:23 +00:00
[![GitHub license](https://img.shields.io/badge/license-SSPL-green.svg)](https://img.shields.io/badge/license-SSPL-green.svg)
2019-06-18 13:32:19 +00:00
[![Python versions](https://img.shields.io/badge/python-3.6%20%7C%203.7-blue.svg)](https://img.shields.io/badge/python-3.6%20%7C%203.7-blue.svg)
2019-06-11 17:09:23 +00:00
[![GitHub version](https://img.shields.io/github/release-pre/allegroai/trains-server.svg)](https://img.shields.io/github/release-pre/allegroai/trains-server.svg)
[![PyPI status](https://img.shields.io/badge/status-beta-yellow.svg)](https://img.shields.io/badge/status-beta-yellow.svg)
2019-06-10 21:24:35 +00:00
## Introduction
2019-06-16 21:55:05 +00:00
The **trains-server** is the backend service infrastructure for [TRAINS](https://github.com/allegroai/trains).
2019-06-12 22:27:36 +00:00
It allows multiple users to collaborate and manage their experiments.
2019-06-16 21:55:05 +00:00
By default, TRAINS is set up to work with the TRAINS demo server, which is open to anyone and resets periodically.
In order to host your own server, you will need to install **trains-server** and point TRAINS to it.
2019-06-12 22:27:36 +00:00
2019-06-16 21:55:05 +00:00
**trains-server** contains the following components:
2019-06-11 15:55:04 +00:00
2019-06-16 21:55:05 +00:00
* The TRAINS Web-App, a single-page UI for experiment management and browsing
* RESTful API for:
* Documenting and logging experiment information, statistics and results
* Querying experiments history, logs and results
* Locally-hosted file server for storing images and models making them easily accessible using the Web-App
2019-06-10 21:24:35 +00:00
2019-06-11 15:55:04 +00:00
You can quickly setup your **trains-server** using a pre-built Docker image (see [Installation](#installation)).
2019-06-10 21:24:35 +00:00
2019-06-12 22:27:36 +00:00
When new releases are available, you can upgrade your pre-built Docker image (see [Upgrade](#upgrade)).
2019-06-10 21:24:35 +00:00
## System diagram
2019-06-13 23:14:14 +00:00
![Alt Text](https://github.com/allegroai/trains/blob/master/docs/system_diagram.png?raw=true)
## Install / Upgrade - AWS <a name="aws"></a>
2019-06-10 21:24:35 +00:00
Use one of our pre-installed Amazon Machine Images for easy deployment in AWS.
2019-06-18 13:32:19 +00:00
For details and instructions, see [TRAINS-server: AWS pre-installed images](docs/install_aws.md).
2019-06-18 13:32:19 +00:00
## Install - Linux, Mac OS X <a name="installation"></a>
2019-06-18 13:32:19 +00:00
Use our pre-built Docker image for easy deployment in Linux and Mac OS X.
For Windows, we recommend installing our pre-built Docker image on a Linux virtual machine.
2019-06-10 21:24:35 +00:00
1. Setup Docker (Full details [Setup Docker Service](docs/docker_setup.md))
2019-06-10 21:24:35 +00:00
Make sure port 8080/8081/8008 are available for the `trains-server` services
Increase vm.max_map_count for `ElasticSearch` docker
2019-06-12 19:53:50 +00:00
```bash
echo "vm.max_map_count=262144" > /tmp/99-trains.conf
sudo mv /tmp/99-trains.conf /etc/sysctl.d/99-trains.conf
sudo sysctl -w vm.max_map_count=262144
sudo sudo service docker restart
```
2019-06-10 21:24:35 +00:00
1. Create local directories for the databases and storage.
2019-06-12 19:53:50 +00:00
```bash
sudo mkdir -p /opt/trains/data/elastic
sudo mkdir -p /opt/trains/data/mongo/db
sudo mkdir -p /opt/trains/data/mongo/configdb
sudo mkdir -p /opt/trains/logs
sudo mkdir -p /opt/trains/data/fileserver
```
Linux
```bash
sudo chown -R 1000:1000 /opt/trains
2019-06-12 19:53:50 +00:00
```
Mac OS X
```bash
sudo chown -R $(whoami):staff /opt/trains
2019-06-12 19:53:50 +00:00
```
1. Clone the [trains-server](https://github.com/allegroai/trains-server) repository and change directories to the new **trains-server** directory.
2019-06-10 21:24:35 +00:00
$ git clone https://github.com/allegroai/trains-server.git
$ cd trains-server
1. Launch the Docker containers <a name="launch-docker"></a>
2019-06-10 21:24:35 +00:00
2019-08-07 23:00:15 +00:00
* Automatically with docker-compose (details: [Linux/Ubuntu](docs/faq.md#ubuntu), [OS X](docs/faq.md#mac-osx))
```bash
$ docker-compose up
2019-06-12 19:53:50 +00:00
```
* Manually
See [TRAINS-server: Launching Docker Containers Manually](docs/manual_docker.md)) for instructions.
1. Your server is now running on [http://localhost:8080](http://localhost:8080) and the following ports are available:
* Web server on port `8080`
* API server on port `8008`
* File server on port `8081`
2019-06-10 21:24:35 +00:00
## Optional: Configuration
2019-07-08 20:58:09 +00:00
The **trains-server** default configuration can be easily overridden using external configuration files. By default, the server will look for these files in `/opt/trains/config`.
If the configuration is changed while the server is running, to apply the changes you must restart the server (see [Restarting trains-server](#restart-server)).
2019-07-08 20:58:09 +00:00
### Configuring TRAINS to Authenticate Web Login Credentials
2019-07-08 20:58:09 +00:00
By default anyone can login to the **trains-server** Web-App.
You can configure the **trains-server** to allow access only to specific users (with pre-configured user/pass).
2019-07-17 15:46:12 +00:00
Enable this feature by placing `apiserver.conf` file under `/opt/trains/config`.
Sample fixed user configuration file `/opt/trains/config/apiserver.conf`:
auth {
# Fixed users login credetials
# No other user will be able to login
fixed_users {
enabled: true
users: [
{
username: "jane"
password: "12345678"
name: "Jane Doe"
},
{
username: "john"
password: "12345678"
name: "John Doe"
},
]
}
2019-07-08 20:58:09 +00:00
}
2019-07-17 15:46:12 +00:00
To apply the `apiserver.conf` changes, you must restart the *trains-apiserver* (docker) (see [Restarting trains-server](#restart-server)).
2019-07-17 15:46:12 +00:00
### Configuring the Non-Responsive Experiments Watchdog Thresholds
2019-07-08 20:58:09 +00:00
The non-responsive experiment watchdog monitors experiments that were not updated for a given period of time,
and marks them as `aborted`. The watchdog is always active with a default of 7200 seconds (2 hours).
2019-07-08 20:58:09 +00:00
To change the watchdog's timeouts, place a `services.conf` file under `/opt/trains/config`, containing for example:
tasks {
non_responsive_tasks_watchdog {
# In-progress tasks that haven't been updated for at least 'value' seconds will be stopped by the watchdog
threshold_sec: 7200
# Watchdog will sleep for this number of seconds after each cycle
watch_interval_sec: 900
}
}
To apply the `services.conf` changes, you must restart the *trains-apiserver* (docker) (see [Restarting trains-server](#restart-server)).
### Restarting trains-server <a name="restart-server"></a>
To restart the **trains-server**, you must first stop and remove the containers, and then restart.
2019-06-10 21:24:35 +00:00
1. Restarting docker-compose containers.
2019-06-10 21:24:35 +00:00
$ docker-compose down
$ docker-compose up
2019-07-29 20:47:52 +00:00
1. Manually restarting dockers [instructions](docs/manual_docker.md).
2019-06-10 21:24:35 +00:00
## Configuring **TRAINS** client
2019-06-10 21:24:35 +00:00
Once you have installed the **trains-server**, make sure to configure **TRAINS** [client](https://github.com/allegroai/trains)
to use your locally installed server (and not the demo server).
2019-06-10 21:24:35 +00:00
- Run the `trains-init` command for an interactive setup
2019-06-12 19:53:50 +00:00
- Or manually edit `~/trains.conf` file, making sure the `api_server` value is configured correctly, for example:
2019-06-12 19:53:50 +00:00
2019-07-08 20:58:09 +00:00
api {
api_server: "http://localhost:8008"
2019-07-08 20:58:09 +00:00
}
2019-06-12 19:53:50 +00:00
2019-06-12 22:27:36 +00:00
See [Installing and Configuring TRAINS](https://github.com/allegroai/trains#installing-and-configuring-trains) for more details.
2019-06-12 19:53:50 +00:00
2019-06-16 21:55:05 +00:00
## What next?
Now that the **trains-server** is installed, and TRAINS is configured to use it,
you can [use](https://github.com/allegroai/trains#using-trains) TRAINS in your experiments and view them in the web server,
for example http://localhost:8080
## Upgrading <a name="upgrade"></a>
2019-06-10 21:24:35 +00:00
2019-06-11 15:55:04 +00:00
We are constantly updating, improving and adding to the **trains-server**.
New releases will include new pre-built Docker images.
When we release a new version and include a new pre-built Docker image for it, upgrade as follows:
1. Shut down and remove each of your Docker instances using the following commands:
2019-08-07 23:00:15 +00:00
* Using Docker-Compose
```bash
$ docker-compose up
```
2019-07-08 20:58:09 +00:00
2019-08-07 23:00:15 +00:00
* Manual Docker launching
2019-07-08 20:58:09 +00:00
2019-08-07 23:00:15 +00:00
sudo docker stop <docker-name>
sudo docker rm -v <docker-name>
The Docker names are (see [Launching Docker Containers](#launch-docker)):
* `trains-elastic`
* `trains-mongo`
* `trains-fileserver`
* `trains-apiserver`
* `trains-webserver`
2019-07-08 20:58:09 +00:00
2019-08-07 23:00:15 +00:00
2. We highly recommend backing up your data directory!. A simple way to do that is using `tar`:
2019-06-11 15:55:04 +00:00
For example, if your data directory is `/opt/trains`, use the following command:
2019-06-12 22:27:36 +00:00
2019-06-11 15:55:04 +00:00
sudo tar czvf ~/trains_backup.tgz /opt/trains/data
2019-06-12 22:27:36 +00:00
2019-08-07 23:00:15 +00:00
This backups all data to an archive in your home directory.
2019-06-12 22:27:36 +00:00
2019-06-11 15:55:04 +00:00
To restore this example backup, use the following command:
2019-06-12 22:27:36 +00:00
2019-06-11 15:55:04 +00:00
sudo rm -R /opt/trains/data
sudo tar -xzf ~/trains_backup.tgz -C /opt/trains/data
2019-08-07 23:00:15 +00:00
3. Pull the new **trains-server** docker image using the following command:
sudo docker pull allegroai/trains:latest
If you wish to pull a different version, replace `latest` with the required version number, for example:
sudo docker pull allegroai/trains:0.10.1
4. Launch the newly released Docker image (see [Launching Docker Containers](#launch-docker)).
2019-06-10 21:24:35 +00:00
2019-08-01 16:36:58 +00:00
## Community & Support
If you have any questions, look to the TRAINS-server [FAQ](https://github.com/allegroai/trains-server/blob/master/docs/faq.md), or
tag your questions on [stackoverflow](https://stackoverflow.com/questions/tagged/trains) with '**trains**' tag.
For feature requests or bug reports, please use [GitHub issues](https://github.com/allegroai/trains-server/issues).
Additionally, you can always find us at *trains@allegro.ai*
2019-06-10 21:24:35 +00:00
## License
[Server Side Public License v1.0](https://github.com/mongodb/mongo/blob/master/LICENSE-Community.txt)
2019-06-16 21:55:05 +00:00
**trains-server** relies on both [MongoDB](https://github.com/mongodb/mongo) and [ElasticSearch](https://github.com/elastic/elasticsearch).
With the recent changes in both MongoDB's and ElasticSearch's OSS license, we feel it is our responsibility as a
member of the community to support the projects we love and cherish.
We believe the cause for the license change in both cases is more than just,
and chose [SSPL](https://www.mongodb.com/licensing/server-side-public-license) because it is the more general and flexible of the two licenses.
2019-06-10 21:24:35 +00:00
This is our way to say - we support you guys!