mirror of
https://github.com/clearml/clearml-server
synced 2025-02-24 13:23:00 +00:00
202 lines
8.5 KiB
Markdown
202 lines
8.5 KiB
Markdown
# TRAINS Server
|
||
## Magic Version Control & Experiment Manager for AI
|
||
|
||
## Introduction
|
||
|
||
The **trains-server** is the infrastructure behind [trains](https://github.com/allegroai/trains).
|
||
|
||
The server provides:
|
||
|
||
* UI (single-page webapp) for experiment management and browsing
|
||
* REST interface for documenting and logging experiment information, statistics and results
|
||
* REST interface for querying experiments history, logs and results
|
||
* Locally-hosted fileserver, for storing images and models to be easily accessible from the UI
|
||
|
||
The server is designed to allow multiple users to collaborate and manage their experiments.
|
||
The server’s code is freely available [here](https://github.com/allegroai/trains-server).
|
||
We've also pre-built a docker image to allow **trains** users to quickly set up their own server.
|
||
|
||
## System diagram
|
||
|
||
<pre>
|
||
TRAINS-server
|
||
+--------------------------------------------------------------------+
|
||
| |
|
||
| Server Docker Elastic Docker Mongo Docker |
|
||
| +-------------------------+ +---------------+ +------------+ |
|
||
| | Pythonic Server | | | | | |
|
||
| | +-----------------+ | | ElasticSearch | | MongoDB | |
|
||
| | | WEB server | | | | | | |
|
||
| | | Port 8080 | | | | | | |
|
||
| | +--------+--------+ | | | | | |
|
||
| | | | | | | | |
|
||
| | +--------+--------+ | | | | | |
|
||
| | | API server +----------------------------+ | |
|
||
| | | Port 8008 +---------+ | | | |
|
||
| | +-----------------+ | +-------+-------+ +-----+------+ |
|
||
| | | | | |
|
||
| | +-----------------+ | +---+----------------+------+ |
|
||
| | | File Server +-------+ | Host Storage | |
|
||
| | | Port 8081 | | +-----+ | |
|
||
| | +-----------------+ | +---------------------------+ |
|
||
| +------------+------------+ |
|
||
+---------------|----------------------------------------------------+
|
||
|HTTP
|
||
+--------+
|
||
GPU Machine |
|
||
+------------------------|-------------------------------------------+
|
||
| +------------------|--------------+ |
|
||
| | Training | | +---------------------+ |
|
||
| | Code +---+------------+ | | trains configuration| |
|
||
| | | TRAINS | | | ~/trains.conf | |
|
||
| | | +------+ | |
|
||
| | +----------------+ | +---------------------+ |
|
||
| +---------------------------------+ |
|
||
+--------------------------------------------------------------------+
|
||
</pre>
|
||
|
||
## Installation
|
||
|
||
In order to install and run the pre-built **trains-server**, you must be logged in as a user with sudo privileges.
|
||
|
||
### Setup
|
||
|
||
In order to run the pre-packaged **trains-server**, you'll need to install **docker**.
|
||
|
||
#### Install docker
|
||
|
||
```bash
|
||
sudo apt-get install docker
|
||
```
|
||
|
||
#### Setup docker daemon
|
||
In order to run the ElasticSearch docker container, you'll need to change some of the default values in the Docker configuration file.
|
||
|
||
For systems with an `/etc/sysconfig/docker` file, add the options in quotes to the available arguments in `OPTIONS`:
|
||
|
||
```bash
|
||
OPTIONS="--default-ulimit nofile=1024:65536 --default-ulimit memlock=-1:-1"
|
||
```
|
||
|
||
For systems with an `/etc/docker/daemon.json` file, add the section in curly brackets to `default-ulimits`:
|
||
|
||
```json
|
||
"default-ulimits": {
|
||
"nofile": {
|
||
"name": "nofile",
|
||
"hard": 65536,
|
||
"soft": 1024
|
||
},
|
||
"memlock":
|
||
{
|
||
"name": "memlock",
|
||
"soft": -1,
|
||
"hard": -1
|
||
}
|
||
}
|
||
```
|
||
|
||
Following this configuration change, you will have to restart the docker daemon:
|
||
|
||
```bash
|
||
sudo service docker stop
|
||
sudo service docker start
|
||
```
|
||
|
||
#### vm.max_map_count
|
||
|
||
The `vm.max_map_count` kernel setting must be at least 262144.
|
||
|
||
The following example was tested with CentOS 7, Ubuntu 16.04, Mint 18.3, Ubuntu 18.04 and Mint 19:
|
||
|
||
```bash
|
||
sudo echo "vm.max_map_count=262144" > /tmp/99-trains.conf
|
||
sudo mv /tmp/99-trains.conf /etc/sysctl.d/99-trains.conf
|
||
sudo sysctl -w vm.max_map_count=262144
|
||
```
|
||
|
||
For additional information about setting this parameter on other systems, see the [elastic](https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html#docker-cli-run-prod-mode) documentation.
|
||
|
||
#### Choose a data folder
|
||
|
||
You will need to choose a directory on your system in which all data maintained by **trains-server** will be stored (among others, this includes database, uploaded files and logs).
|
||
|
||
The following instructions assume the directory is `/opt/trains`.
|
||
|
||
Issue the following commands:
|
||
|
||
```bash
|
||
sudo mkdir -p /opt/trains/data/elastic && sudo chown -R 1000:1000 /opt/trains
|
||
```
|
||
|
||
### Launching docker images
|
||
|
||
|
||
To launch the docker images, issue the following commands:
|
||
|
||
|
||
```bash
|
||
sudo docker run -d --restart="always" --name="trains-elastic" -e "ES_JAVA_OPTS=-Xms2g -Xmx2g" -e "bootstrap.memory_lock=true" -e "cluster.name=trains" -e "discovery.zen.minimum_master_nodes=1" -e "node.name=trains" -e "script.inline=true" -e "script.update=true" -e "thread_pool.bulk.queue_size=2000" -e "thread_pool.search.queue_size=10000" -e "xpack.security.enabled=false" -e "xpack.monitoring.enabled=false" -e "cluster.routing.allocation.node_initial_primaries_recoveries=500" -e "node.ingest=true" -e "http.compression_level=7" -e "reindex.remote.whitelist=*.*" -e "script.painless.regex.enabled=true" --network="host" -v /opt/trains/data/elastic:/usr/share/elasticsearch/data docker.elastic.co/elasticsearch/elasticsearch:5.6.16
|
||
```
|
||
|
||
```bash
|
||
sudo docker run -d --restart="always" --name="trains-mongo" -v /opt/trains/data/mongo/db:/data/db -v /opt/trains/data/mongo/configdb:/data/configdb --network="host" mongo:3.6.5
|
||
```
|
||
|
||
```bash
|
||
sudo docker run -d --restart="always" --name="trains-fileserver" --network="host" -v /opt/trains/logs:/var/log/trains -v /opt/trains/data/fileserver:/mnt/fileserver allegroai/trains:latest fileserver
|
||
```
|
||
|
||
```bash
|
||
sudo docker run -d --restart="always" --name="trains-apiserver" --network="host" -v /opt/trains/logs:/var/log/trains allegroai/trains:latest apiserver
|
||
```
|
||
|
||
```bash
|
||
sudo docker run -d --restart="always" --name="trains-webserver" --network="host" -v /opt/trains/logs:/var/log/trains allegroai/trains:latest webserver
|
||
```
|
||
|
||
Once the **trains-server** dockers are up, the following are available:
|
||
|
||
* API server on port `8008`
|
||
* Web server on port `8080`
|
||
* File server on port `8081`
|
||
|
||
## Upgrade
|
||
|
||
We are constantly updating and adding stuff.
|
||
When we release a new version, we’ll include a new pre-built docker image.
|
||
Once a new release is out, you can simply:
|
||
|
||
1. Shut down and remove your docker instances. Each instance can be shut down and removed using the following commands:
|
||
```bash
|
||
sudo docker stop <docker-name>
|
||
sudo docker rm -v <docker-name>
|
||
```
|
||
The docker names are (see [Launching docker images](#Launching-docker-images)):
|
||
* `trains-elastic`
|
||
* `trains-mongo`
|
||
* `trains-fileserver`
|
||
* `trains-apiserver`
|
||
* `trains-webserver`
|
||
|
||
2. Back up your data folder (recommended!). A simple way to do that is using this command:
|
||
```bash
|
||
sudo tar czvf ~/trains_backup.tgz /opt/trains/data
|
||
```
|
||
Which will back up all data to an archive in your home folder. Restoring such a backup can be done using these commands:
|
||
```bash
|
||
sudo rm -R /opt/trains/data
|
||
sudo tar -xzf ~/trains_backup.tgz -C /opt/trains/data
|
||
```
|
||
3. Launch the newly released docker image (see [Launching docker images](#Launching-docker-images))
|
||
|
||
## License
|
||
|
||
[Server Side Public License v1.0](https://github.com/mongodb/mongo/blob/master/LICENSE-Community.txt)
|
||
|
||
**trains-server** relies *heavily* on both [MongoDB](https://github.com/mongodb/mongo) and [ElasticSearch](https://github.com/elastic/elasticsearch).
|
||
With the recent changes in both MongoDB's and ElasticSearch's OSS license, we feel it is our job as a community to support the projects we love and cherish.
|
||
We feel the cause for the license change in both cases is more than just, and chose [SSPL](https://www.mongodb.com/licensing/server-side-public-license) because it is the more restrictive of the two.
|
||
|
||
This is our way to say - we support you guys!
|