mirror of https://github.com/clearml/clearml-server synced 2025-01-31 02:46:53 +00:00

Go to file

allegroai a6344bad57 Initial commit		2019-06-11 00:24:35 +03:00
fileserver	Initial commit	2019-06-11 00:24:35 +03:00
server	Initial commit	2019-06-11 00:24:35 +03:00
webserver	Initial commit	2019-06-11 00:24:35 +03:00
.gitignore	Add .gitignore	2019-06-11 00:24:11 +03:00
README.md	Initial commit	2019-06-11 00:24:35 +03:00

README.md

TRAINS Server

Magic Version Control & Experiment Manager for AI

Introduction

The trains-server is the infrastructure behind trains.

The server provides:

UI (single-page webapp) for experiment management and browsing
REST interface for documenting and logging experiment information, statistics and results
REST interface for querying experiments history, logs and results
Locally-hosted fileserver, for storing images and models to be easily accessible from the UI

The server is designed to allow multiple users to collaborate and manage their experiments. The server’s code is freely available here. We've also pre-built a docker image to allow trains users to quickly set up their own server.

System diagram

 TRAINS-server
 +--------------------------------------------------------------------+
 |                                                                    |
 |   Server Docker                   Elastic Docker     Mongo Docker  |
 |  +-------------------------+     +---------------+  +------------+ |
 |  |     Pythonic Server     |     |               |  |            | |
 |  |   +-----------------+   |     | ElasticSearch |  |  MongoDB   | |
 |  |   |   WEB server    |   |     |               |  |            | |
 |  |   |   Port 8080     |   |     |               |  |            | |
 |  |   +--------+--------+   |     |               |  |            | |
 |  |            |            |     |               |  |            | |
 |  |   +--------+--------+   |     |               |  |            | |
 |  |   |   API server    +----------------------------+            | |
 |  |   |   Port 8008     +---------+               |  |            | |
 |  |   +-----------------+   |     +-------+-------+  +-----+------+ |
 |  |                         |             |                |        |
 |  |   +-----------------+   |         +---+----------------+------+ |
 |  |   |   File Server   +-------+     |    Host Storage           | |
 |  |   |   Port 8081     |   |   +-----+                           | |
 |  |   +-----------------+   |         +---------------------------+ |
 |  +------------+------------+                                       |
 +---------------|----------------------------------------------------+
                 |HTTP
                 +--------+
 GPU Machine              |
 +------------------------|-------------------------------------------+
 |     +------------------|--------------+                            |
 |     |  Training        |              |    +---------------------+ |
 |     |  Code        +---+------------+ |    | trains configuration| |
 |     |              | TRAINS         | |    | ~/trains.conf       | |
 |     |              |                +------+                     | |
 |     |              +----------------+ |    +---------------------+ |
 |     +---------------------------------+                            |
 +--------------------------------------------------------------------+

Installation

In order to install and run the pre-built trains-server, you must be logged in as a user with sudo privileges.

Setup

In order to run the pre-packaged trains-server, you'll need to install docker.

Install docker

sudo apt-get install docker

Setup docker daemon

In order to run the ElasticSearch docker container, you'll need to change some of the default values in the Docker configuration file.

For systems with an /etc/sysconfig/docker file, add the options in quotes to the available arguments in OPTIONS:

OPTIONS="--default-ulimit nofile=1024:65536 --default-ulimit memlock=-1:-1"

For systems with an /etc/docker/daemon.json file, add the section in curly brackets to default-ulimits:

"default-ulimits": {
    "nofile": {
            "name": "nofile",
            "hard": 65536,
            "soft": 1024
    },
    "memlock":
    {
            "name": "memlock",
            "soft": -1,
            "hard": -1
    }
}

Following this configuration change, you will have to restart the docker daemon:

sudo service docker stop
sudo service docker start

vm.max_map_count

The vm.max_map_count kernel setting must be at least 262144.

The following example was tested with CentOS 7, Ubuntu 16.04, Mint 18.3, Ubuntu 18.04 and Mint 19:

sudo echo "vm.max_map_count=262144" > /tmp/99-trains.conf
sudo mv /tmp/99-trains.conf /etc/sysctl.d/99-trains.conf
sudo sysctl -w vm.max_map_count=262144

For additional information about setting this parameter on other systems, see the elastic documentation.

Choose a data folder

You will need to choose a directory on your system in which all data maintained by trains-server will be stored (among others, this includes database, uploaded files and logs).

The following instructions assume the directory is /opt/trains.

Issue the following commands:

sudo mkdir -p /opt/trains/data/elastic && sudo chown -R 1000:1000 /opt/trains

Launching docker images

To launch the docker images, issue the following commands:

sudo docker run -d --restart="always" --name="trains-elastic" -e "ES_JAVA_OPTS=-Xms2g -Xmx2g" -e "bootstrap.memory_lock=true" -e "cluster.name=trains" -e "discovery.zen.minimum_master_nodes=1" -e "node.name=trains" -e "script.inline=true" -e "script.update=true" -e "thread_pool.bulk.queue_size=2000" -e "thread_pool.search.queue_size=10000" -e "xpack.security.enabled=false" -e "xpack.monitoring.enabled=false" -e "cluster.routing.allocation.node_initial_primaries_recoveries=500" -e "node.ingest=true" -e "http.compression_level=7" -e "reindex.remote.whitelist=*.*" -e "script.painless.regex.enabled=true" --network="host" -v /opt/trains/data/elastic:/usr/share/elasticsearch/data docker.elastic.co/elasticsearch/elasticsearch:5.6.16

sudo docker run -d --restart="always" --name="trains-mongo" -v /opt/trains/data/mongo/db:/data/db -v /opt/trains/data/mongo/configdb:/data/configdb --network="host" mongo:3.6.5

sudo docker run -d --restart="always" --name="trains-fileserver" --network="host" -v /opt/trains/logs:/var/log/trains -v /opt/trains/data/fileserver:/mnt/fileserver allegroai/trains:latest fileserver

sudo docker run -d --restart="always" --name="trains-apiserver" --network="host" -v /opt/trains/logs:/var/log/trains allegroai/trains:latest apiserver

sudo docker run -d --restart="always" --name="trains-webserver" --network="host" -v /opt/trains/logs:/var/log/trains allegroai/trains:latest webserver

Once the trains-server dockers are up, the following are available:

API server on port 8008
Web server on port 8080
File server on port 8081

Upgrade

We are constantly updating and adding stuff. When we release a new version, we’ll include a new pre-built docker image. Once a new release is out, you can simply:

Shut down and remove your docker instances. Each instance can be shut down and removed using the following commands:
```
sudo docker stop <docker-name>
sudo docker rm -v <docker-name>
```
The docker names are (see Launching docker images):
- trains-elastic
- trains-mongo
- trains-fileserver
- trains-apiserver
- trains-webserver
Back up your data folder (recommended!). A simple way to do that is using this command:
```
sudo tar czvf ~/trains_backup.tgz /opt/trains/data
```
Which will back up all data to an archive in your home folder. Restoring such a backup can be done using these commands:
```
sudo rm -R /opt/trains/data
sudo tar -xzf ~/trains_backup.tgz -C /opt/trains/data
```
Launch the newly released docker image (see Launching docker images)

License

Server Side Public License v1.0

trains-server relies heavily on both MongoDB and ElasticSearch. With the recent changes in both MongoDB's and ElasticSearch's OSS license, we feel it is our job as a community to support the projects we love and cherish. We feel the cause for the license change in both cases is more than just, and chose SSPL because it is the more restrictive of the two.

This is our way to say - we support you guys!

README.md Unescape Escape