clearml-server/README.md

251 lines
11 KiB
Markdown
Raw Normal View History

2019-06-10 21:24:35 +00:00
# TRAINS Server
2019-06-11 15:55:04 +00:00
2019-06-10 21:24:35 +00:00
## Magic Version Control & Experiment Manager for AI
2019-06-11 17:09:23 +00:00
[![GitHub license](https://img.shields.io/badge/license-SSPL-green.svg)](https://img.shields.io/badge/license-SSPL-green.svg)
[![GitHub version](https://img.shields.io/github/release-pre/allegroai/trains-server.svg)](https://img.shields.io/github/release-pre/allegroai/trains-server.svg)
[![PyPI status](https://img.shields.io/badge/status-beta-yellow.svg)](https://img.shields.io/badge/status-beta-yellow.svg)
2019-06-10 21:24:35 +00:00
## Introduction
2019-06-11 17:09:23 +00:00
The **trains-server** is the infrastructure for [TRAINS](https://github.com/allegroai/trains).
2019-06-11 15:55:04 +00:00
It allows multiple users to collaborate and manage their experiments.
The **trains-server** contains the following components:
* the Web-App which is a single-page UI for experiment management and browsing
* a REST interface for:
* documenting and logging experiment information, statistics and results
* querying experiments history, logs and results
* a locally-hosted file server for storing images and models making them easily accessible using the Web-App
2019-06-10 21:24:35 +00:00
2019-06-11 15:55:04 +00:00
You can quickly setup your **trains-server** using a pre-built Docker image (see [Installation](#installation)).
2019-06-10 21:24:35 +00:00
2019-06-11 15:55:04 +00:00
When new releases are available, you can upgrade your pre-built Docker image (see [Upgrade](#upgrade)).
2019-06-10 21:24:35 +00:00
2019-06-11 15:55:04 +00:00
The **trains-server's** code is freely available [here](https://github.com/allegroai/trains-server).
2019-06-10 21:24:35 +00:00
## System diagram
<pre>
TRAINS-server
+--------------------------------------------------------------------+
| |
| Server Docker Elastic Docker Mongo Docker |
| +-------------------------+ +---------------+ +------------+ |
| | Pythonic Server | | | | | |
| | +-----------------+ | | ElasticSearch | | MongoDB | |
| | | WEB server | | | | | | |
| | | Port 8080 | | | | | | |
| | +--------+--------+ | | | | | |
| | | | | | | | |
| | +--------+--------+ | | | | | |
| | | API server +----------------------------+ | |
| | | Port 8008 +---------+ | | | |
| | +-----------------+ | +-------+-------+ +-----+------+ |
| | | | | |
| | +-----------------+ | +---+----------------+------+ |
| | | File Server +-------+ | Host Storage | |
| | | Port 8081 | | +-----+ | |
| | +-----------------+ | +---------------------------+ |
| +------------+------------+ |
+---------------|----------------------------------------------------+
|HTTP
+--------+
GPU Machine |
+------------------------|-------------------------------------------+
| +------------------|--------------+ |
| | Training | | +---------------------+ |
2019-06-11 17:09:23 +00:00
| | Code +---+------------+ | | TRAINS configuration| |
2019-06-10 21:24:35 +00:00
| | | TRAINS | | | ~/trains.conf | |
| | | +------+ | |
| | +----------------+ | +---------------------+ |
| +---------------------------------+ |
+--------------------------------------------------------------------+
</pre>
## Installation
2019-06-11 15:55:04 +00:00
This section contains the instructions to setup and launch a pre-built Docker image for the **trains-server**.
**Note**: This Docker image was tested with Linux, only. For Windows users, we recommend running the server
on a Linux virtual machine.
2019-06-10 21:24:35 +00:00
2019-06-11 15:55:04 +00:00
### Prerequisites
You must be logged in as a user with sudo privileges.
2019-06-10 21:24:35 +00:00
### Setup
2019-06-11 15:55:04 +00:00
#### Step 1. Install Docker CE
You must install Docker to run the pre-packaged **trains-server**.
2019-06-10 21:24:35 +00:00
2019-06-11 15:55:04 +00:00
* For [Ubuntu](https://docs.docker.com/install/linux/docker-ce/ubuntu/) / Mint (x86_64/amd64):
2019-06-10 21:24:35 +00:00
```bash
2019-06-11 15:55:04 +00:00
sudo apt-get install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
. /etc/os-release
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $UBUNTU_CODENAME stable"
sudo apt-get update
sudo apt-get install -y docker-ce
2019-06-10 21:24:35 +00:00
```
2019-06-11 15:55:04 +00:00
* For other operating systems, see [Supported platforms](https://docs.docker.com/install//#support) in the Docker documentation for instructions.
#### Step 2. Setup the Docker daemon
2019-06-11 23:48:34 +00:00
To run the ElasticSearch Docker container, you must setup the Docker daemon by modifying the default
2019-06-11 15:55:04 +00:00
values required by Elastic in your Docker configuration file
that are used by the **trains-server**. We provide instructions for the most common Docker configuration files.
You must edit or create a Docker configuration file:
* If your Docker configuration file is `/etc/sysconfig/docker`, edit it.
2019-06-10 21:24:35 +00:00
2019-06-11 15:55:04 +00:00
Add the options in quotes to the available arguments in the `OPTIONS` section:
2019-06-10 21:24:35 +00:00
```bash
OPTIONS="--default-ulimit nofile=1024:65536 --default-ulimit memlock=-1:-1"
```
2019-06-11 15:55:04 +00:00
* Otherwise, edit `/etc/docker/daemon.json` (if it exists) or create it (if it does not exist).
Add or modify the `defaults-ulimits` section as shown below. Be sure your configuration file contains the `nofile` and `memlock` sub-sections and values shown.
**Note**: Your configuration file may contain other sections. If so, confirm that the sections are separated by commas. For more information about Docker configuration files, see an [Daemon configuration file](https://docs.docker.com/engine/reference/commandline/dockerd/#daemon-configuration-file) in the Docker documentation.
The **trains-server** required defaults values are:
2019-06-10 21:24:35 +00:00
```json
2019-06-11 15:55:04 +00:00
{
"default-ulimits": {
"nofile": {
2019-06-10 21:24:35 +00:00
"name": "nofile",
"hard": 65536,
"soft": 1024
2019-06-11 15:55:04 +00:00
},
"memlock":
{
2019-06-10 21:24:35 +00:00
"name": "memlock",
"soft": -1,
"hard": -1
2019-06-11 15:55:04 +00:00
}
2019-06-10 21:24:35 +00:00
}
}
```
2019-06-11 15:55:04 +00:00
#### Step 3. Restart the Docker daemon
You must restart the Docker daemon after modifying the configuration file:
2019-06-10 21:24:35 +00:00
```bash
sudo service docker stop
sudo service docker start
```
2019-06-11 15:55:04 +00:00
#### Step 4. Set the Maximum Number of Memory Map Areas
2019-06-10 21:24:35 +00:00
2019-06-11 15:55:04 +00:00
The maximum number of memory map areas a process can use is defined
using the `vm.max_map_count` kernel setting.
2019-06-10 21:24:35 +00:00
2019-06-11 15:55:04 +00:00
Elastic requires that `vm.max_map_count` to be at least 262144.
* For CentOS 7, Ubuntu 16.04, Mint 18.3, Ubuntu 18.04 and Mint 19 users, we tested the following commands to set
`vm.max_map_count`:
2019-06-10 21:24:35 +00:00
```bash
sudo echo "vm.max_map_count=262144" > /tmp/99-trains.conf
sudo mv /tmp/99-trains.conf /etc/sysctl.d/99-trains.conf
sudo sysctl -w vm.max_map_count=262144
```
2019-06-11 15:55:04 +00:00
* For information about setting this parameter on other systems, see the [elastic](https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html#docker-cli-run-prod-mode) documentation.
2019-06-10 21:24:35 +00:00
2019-06-11 15:55:04 +00:00
#### Step 5. Choose a Data Directory
2019-06-10 21:24:35 +00:00
2019-06-11 15:55:04 +00:00
You must choose a directory on your system in which all data maintained by the **trains-server** is stored,
create that directory, and set its permissions. The data stored in that directory includes the database, uploaded files and logs.
2019-06-10 21:24:35 +00:00
2019-06-11 15:55:04 +00:00
For example, if your data directory is `/opt/trains`, then use the following command:
2019-06-10 21:24:35 +00:00
```bash
sudo mkdir -p /opt/trains/data/elastic && sudo chown -R 1000:1000 /opt/trains
```
2019-06-11 15:55:04 +00:00
### Launching Docker Containers
2019-06-10 21:24:35 +00:00
2019-06-11 15:55:04 +00:00
Launch the Docker containers. For example, if your data directory is `\opt\trains`,
then use the following commands:
2019-06-10 21:24:35 +00:00
```bash
sudo docker run -d --restart="always" --name="trains-elastic" -e "ES_JAVA_OPTS=-Xms2g -Xmx2g" -e "bootstrap.memory_lock=true" -e "cluster.name=trains" -e "discovery.zen.minimum_master_nodes=1" -e "node.name=trains" -e "script.inline=true" -e "script.update=true" -e "thread_pool.bulk.queue_size=2000" -e "thread_pool.search.queue_size=10000" -e "xpack.security.enabled=false" -e "xpack.monitoring.enabled=false" -e "cluster.routing.allocation.node_initial_primaries_recoveries=500" -e "node.ingest=true" -e "http.compression_level=7" -e "reindex.remote.whitelist=*.*" -e "script.painless.regex.enabled=true" --network="host" -v /opt/trains/data/elastic:/usr/share/elasticsearch/data docker.elastic.co/elasticsearch/elasticsearch:5.6.16
```
```bash
sudo docker run -d --restart="always" --name="trains-mongo" -v /opt/trains/data/mongo/db:/data/db -v /opt/trains/data/mongo/configdb:/data/configdb --network="host" mongo:3.6.5
```
```bash
sudo docker run -d --restart="always" --name="trains-fileserver" --network="host" -v /opt/trains/logs:/var/log/trains -v /opt/trains/data/fileserver:/mnt/fileserver allegroai/trains:latest fileserver
```
```bash
sudo docker run -d --restart="always" --name="trains-apiserver" --network="host" -v /opt/trains/logs:/var/log/trains allegroai/trains:latest apiserver
```
```bash
sudo docker run -d --restart="always" --name="trains-webserver" --network="host" -v /opt/trains/logs:/var/log/trains allegroai/trains:latest webserver
```
2019-06-11 15:55:04 +00:00
After the **trains-server** Dockers are up, the following are available:
2019-06-10 21:24:35 +00:00
* API server on port `8008`
* Web server on port `8080`
* File server on port `8081`
## Upgrade
2019-06-11 15:55:04 +00:00
We are constantly updating, improving and adding to the **trains-server**.
New releases will include new pre-built Docker images.
When we release a new version and include a new pre-built Docker image for it, upgrade as follows:
1. Shut down and remove each of your Docker instances using the following commands:
sudo docker stop <docker-name>
sudo docker rm -v <docker-name>
2019-06-11 23:48:34 +00:00
The Docker names are (see [Launching Docker Containers](#launching-docker-containers)):
2019-06-11 15:55:04 +00:00
2019-06-11 22:19:38 +00:00
* `trains-elastic`
* `trains-mongo`
* `trains-fileserver`
* `trains-apiserver`
* `trains-webserver`
2019-06-11 15:55:04 +00:00
2. We highly recommend backing up your data directory!. A simple way to do that is using `tar`:
For example, if your data directory is `/opt/trains`, use the following command:
sudo tar czvf ~/trains_backup.tgz /opt/trains/data
This back ups all data to an archive in your home directory.
To restore this example backup, use the following command:
sudo rm -R /opt/trains/data
sudo tar -xzf ~/trains_backup.tgz -C /opt/trains/data
2019-06-11 23:48:34 +00:00
3. Launch the newly released Docker image (see [Launching Docker Containers](#launching-docker-containers)).
2019-06-10 21:24:35 +00:00
## License
[Server Side Public License v1.0](https://github.com/mongodb/mongo/blob/master/LICENSE-Community.txt)
**trains-server** relies *heavily* on both [MongoDB](https://github.com/mongodb/mongo) and [ElasticSearch](https://github.com/elastic/elasticsearch).
With the recent changes in both MongoDB's and ElasticSearch's OSS license, we feel it is our job as a community to support the projects we love and cherish.
2019-06-11 15:55:04 +00:00
We feel the cause for the license change in both cases is more than just, and chose [SSPL](https://www.mongodb.com/licensing/server-side-public-license) because it is the more general and flexible of the two.
2019-06-10 21:24:35 +00:00
This is our way to say - we support you guys!