From 1e701becd3e48f9b3171714d4140c4f31a2b4323 Mon Sep 17 00:00:00 2001 From: allegroai <> Date: Tue, 29 Oct 2019 20:43:46 +0200 Subject: [PATCH] Upgrade to v0.12 --- README.md | 163 ++++++++++++++++++++++++------------------- docs/docker_setup.md | 19 +++-- docs/install_aws.md | 71 ++++++++++++++----- 3 files changed, 157 insertions(+), 96 deletions(-) diff --git a/README.md b/README.md index 6ba28f5..1011963 100644 --- a/README.md +++ b/README.md @@ -11,7 +11,7 @@ The **trains-server** is the backend service infrastructure for [TRAINS](https://github.com/allegroai/trains). It allows multiple users to collaborate and manage their experiments. -By default, TRAINS is set up to work with the TRAINS demo server, which is open to anyone and resets periodically. +By default, TRAINS is set up to work with the TRAINS demo server, which is open to anyone and resets periodically. In order to host your own server, you will need to install **trains-server** and point TRAINS to it. **trains-server** contains the following components: @@ -23,9 +23,9 @@ In order to host your own server, you will need to install **trains-server** and * Locally-hosted file server for storing images and models making them easily accessible using the Web-App You can quickly setup your **trains-server** using: - - [Docker Installation](#installation) + - [Docker Installation](#installation) - Pre-built Amazon [AWS image](#aws) - - [Kubernetes Helm](https://github.com/allegroai/trains-server-helm#trains-server-for-kubernetes-clusters-using-helm) + - [Kubernetes Helm](https://github.com/allegroai/trains-server-helm#trains-server-for-kubernetes-clusters-using-helm) or manual [Kubernetes installation](https://github.com/allegroai/trains-server-k8s#trains-server-for-kubernetes-clusters) @@ -36,80 +36,81 @@ You can quickly setup your **trains-server** using: **trains-server** has two supported configurations: - Single IP (domain) with the following open ports - - Web application on port 8080 + - Web application on port 8080 - API service on port 8008 - File storage service on port 8081 - + - Sub-Domain configuration with default http/s ports (80 or 443) - Web application on sub-domain: app.\*.\* - API service on sub-domain: api.\*.\* - File storage service on sub-domain: files.\*.\* - + ## Install / Upgrade - AWS -Use one of our pre-installed Amazon Machine Images for easy deployment in AWS. +Use one of our pre-installed Amazon Machine Images for easy deployment in AWS. For details and instructions, see [TRAINS-server: AWS pre-installed images](docs/install_aws.md). ## Docker Installation - Linux, Mac OS X -Use our pre-built Docker image for easy deployment in Linux and Mac OS X. +Use our pre-built Docker image for easy deployment in Linux and Mac OS X. For Windows, we recommend installing our pre-built Docker image on a Linux virtual machine. Latest docker images can be found [here](https://hub.docker.com/r/allegroai/trains). 1. Setup Docker ([docker-compose Ubuntu](docs/faq.md#ubuntu), [docker-compose OS X](docs/faq.md#mac-osx), [Setup Docker Service Manually](docs/docker_setup.md#setup-docker)) - Make sure port 8080/8081/8008 are available for the `trains-server` services - + Make sure port 8080/8081/8008 are available for the `trains-server` services + Increase vm.max_map_count for `ElasticSearch` docker - + ```bash echo "vm.max_map_count=262144" > /tmp/99-trains.conf sudo mv /tmp/99-trains.conf /etc/sysctl.d/99-trains.conf sudo sysctl -w vm.max_map_count=262144 - + sudo service docker restart - ``` + ``` 1. Create local directories for the databases and storage. - + ```bash sudo mkdir -p /opt/trains/data/elastic sudo mkdir -p /opt/trains/data/mongo/db sudo mkdir -p /opt/trains/data/mongo/configdb + sudo mkdir -p /opt/trains/data/redis sudo mkdir -p /opt/trains/logs sudo mkdir -p /opt/trains/data/fileserver sudo mkdir -p /opt/trains/config - ``` + ``` Linux ```bash $ sudo chown -R 1000:1000 /opt/trains ``` Mac OS X - ```bash + ```bash $ sudo chown -R $(whoami):staff /opt/trains ``` - + 1. Clone the [trains-server](https://github.com/allegroai/trains-server) repository and change directories to the new **trains-server** directory. - - ```bash + + ```bash $ git clone https://github.com/allegroai/trains-server.git $ cd trains-server ``` - + 1. Launch the Docker containers * Automatically with docker-compose (details: [Linux/Ubuntu](docs/faq.md#ubuntu), [OS X](docs/faq.md#mac-osx)) - - ```bash + + ```bash $ docker-compose up ``` - + * Manually, see [Launching Docker Containers Manually](docs/docker_setup.md#launch) for instructions. - + 1. Your server is now running on [http://localhost:8080](http://localhost:8080) and the following ports are available: - + * Web server on port `8080` * API server on port `8008` * File server on port `8081` @@ -126,12 +127,12 @@ By default anyone can login to the **trains-server** Web-App. You can configure the **trains-server** to allow only a specific set of users to access the system. Enable this feature by placing `apiserver.conf` file under `/opt/trains/config`. - + Sample fixed user configuration file `/opt/trains/config/apiserver.conf`: auth { - # Fixed users login credetials + # Fixed users login credetials # No other user will be able to login fixed_users { enabled: true @@ -149,12 +150,12 @@ Sample fixed user configuration file `/opt/trains/config/apiserver.conf`: ] } } - + To apply the `apiserver.conf` changes, you must restart the *trains-apiserver* (docker) (see [Restarting trains-server](#restart-server)). ### Configuring the Non-Responsive Experiments Watchdog -The non-responsive experiment watchdog, monitors experiments that were not updated for a given period of time, +The non-responsive experiment watchdog, monitors experiments that were not updated for a given period of time, and marks them as `aborted`. The watchdog is always active with a default of 7200 seconds (2 hours) of inactivity threshold. To change the watchdog's timeouts, place a `services.conf` file under `/opt/trains/config`. @@ -165,7 +166,7 @@ Sample watchdog configuration file `/opt/trains/config/services.conf`: non_responsive_tasks_watchdog { # In-progress tasks that haven't been updated for at least 'value' seconds will be stopped by the watchdog threshold_sec: 7200 - + # Watchdog will sleep for this number of seconds after each cycle watch_interval_sec: 900 } @@ -181,38 +182,38 @@ To restart the **trains-server**, you must first stop and remove the containers, $ docker-compose down $ docker-compose up - + 1. Manually restarting dockers [instructions](docs/docker_setup.md#launch). ## Configuring **TRAINS** client -Once you have installed the **trains-server**, make sure to configure **TRAINS** [client](https://github.com/allegroai/trains) +Once you have installed the **trains-server**, make sure to configure **TRAINS** [client](https://github.com/allegroai/trains) to use your locally installed server (and not the demo server). -- Run the `trains-init` command for an interactive setup +- Run the `trains-init` command for an interactive setup - Or manually edit `~/trains.conf` file, making sure the `api_server` value is configured correctly, for example: api { # API server on port 8008 api_server: "http://localhost:8008" - + # web_server on port 8080 web_server: "http://localhost:8080" - + # file server on port 8081 files_server: "http://localhost:8081" } -* Notice that if you setup **trains-server** in a sub-domain configuration, there is no need to specify a port number, +* Notice that if you setup **trains-server** in a sub-domain configuration, there is no need to specify a port number, it will be inferred from the http/s scheme. See [Installing and Configuring TRAINS](https://github.com/allegroai/trains#configuration) for more details. ## What next? -Now that the **trains-server** is installed, and TRAINS is configured to use it, -you can [use](https://github.com/allegroai/trains#using-trains) TRAINS in your experiments and view them in the web server, +Now that the **trains-server** is installed, and TRAINS is configured to use it, +you can [use](https://github.com/allegroai/trains#using-trains) TRAINS in your experiments and view them in the web server, for example http://localhost:8080 ## Upgrading @@ -221,15 +222,29 @@ We are constantly updating, improving and adding to the **trains-server**. New releases will include new pre-built Docker images. When we release a new version and include a new pre-built Docker image for it, upgrade as follows: -1. Shut down and remove each of your Docker instances using the following commands: +* Upgrading your docker-compose installation - * Using Docker-Compose - - ```bash - $ docker-compose down - ``` + * Shut down the docker containers + ```bash + $ docker-compose down + ``` + + * We highly recommend backing up your data directory before upgrading + (see **Step ii** in the Manual Docker upgrade) - * Manual Docker launching + * Spin up the docker containers, it will automatically pull the latest trains-server build + ```bash + $ docker-compose up + ``` + + * In case of a docker error: "... The container name "/trains-???" is already in use by ..." + Try removing deprecated images with: + ```bash + $ docker rm -f $(docker ps -a -q) + ``` + +* Manual Docker upgrade + 1. Shut down and remove each of your Docker instances using the following commands: ```bash $ sudo docker stop @@ -240,37 +255,39 @@ When we release a new version and include a new pre-built Docker image for it, u * `trains-elastic` * `trains-mongo` + * `trains-redis` * `trains-fileserver` * `trains-apiserver` * `trains-webserver` - -2. We highly recommend backing up your data directory!. A simple way to do that is using `tar`: - - For example, if your data directory is `/opt/trains`, use the following command: - - ```bash - $ sudo tar czvf ~/trains_backup.tgz /opt/trains/data - ``` - This backups all data to an archive in your home directory. - - To restore this example backup, use the following command: - ```bash - $ sudo rm -R /opt/trains/data - $ sudo tar -xzf ~/trains_backup.tgz -C /opt/trains/data - ``` -3. Pull the new **trains-server** docker image using the following command: - - ```bash - $ sudo docker pull allegroai/trains:latest - ``` + 2. We highly recommend backing up your data directory!. A simple way to do that is using `tar`: - If you wish to pull a different version, replace `latest` with the required version number, for example: - ```bash - $ sudo docker pull allegroai/trains:0.10.1 - ``` - -4. Launch the newly released Docker image (see [Launching Docker Containers](#launch-docker)). + For example, if your data directory is `/opt/trains`, use the following command: + + ```bash + $ sudo tar czvf ~/trains_backup.tgz /opt/trains/data + ``` + This backups all data to an archive in your home directory. + + To restore this example backup, use the following command: + ```bash + $ sudo rm -R /opt/trains/data + $ sudo tar -xzf ~/trains_backup.tgz -C /opt/trains/data + ``` + + 3. Pull the new **trains-server** docker image using the following command: + + ```bash + $ sudo docker pull allegroai/trains:latest + ``` + + If you wish to pull a different version, replace `latest` with the required version number, for example: + ```bash + $ sudo docker pull allegroai/trains:0.11.0 + ``` + + 4. Launch the newly released Docker image (see [Launching Docker Containers](#launch-docker)). + ## Community & Support @@ -286,9 +303,9 @@ Additionally, you can always find us at *trains@allegro.ai* [Server Side Public License v1.0](https://github.com/mongodb/mongo/blob/master/LICENSE-Community.txt) **trains-server** relies on both [MongoDB](https://github.com/mongodb/mongo) and [ElasticSearch](https://github.com/elastic/elasticsearch). -With the recent changes in both MongoDB's and ElasticSearch's OSS license, we feel it is our responsibility as a +With the recent changes in both MongoDB's and ElasticSearch's OSS license, we feel it is our responsibility as a member of the community to support the projects we love and cherish. -We believe the cause for the license change in both cases is more than just, +We believe the cause for the license change in both cases is more than just, and chose [SSPL](https://www.mongodb.com/licensing/server-side-public-license) because it is the more general and flexible of the two licenses. This is our way to say - we support you guys! diff --git a/docs/docker_setup.md b/docs/docker_setup.md index f77311b..917dfae 100644 --- a/docs/docker_setup.md +++ b/docs/docker_setup.md @@ -1,6 +1,6 @@ # TRAINS-server: Using Docker Pre-Built Images -The pre-built Docker image for the **trains-server** is the quickest way to get started with your own **TRAINS** server. +The pre-built Docker image for the **trains-server** is the quickest way to get started with your own **TRAINS** server. You can also build the entire **trains-server** architecture using the code available in the [trains-server](https://github.com/allegroai/trains-server) repository. @@ -61,6 +61,7 @@ For example, if your data directory is `/opt/trains`, then use the following com sudo mkdir -p /opt/trains/data/elastic sudo mkdir -p /opt/trains/data/mongo/db sudo mkdir -p /opt/trains/data/mongo/configdb + sudo mkdir -p /opt/trains/data/redis sudo mkdir -p /opt/trains/logs sudo mkdir -p /opt/trains/data/fileserver sudo mkdir -p /opt/trains/config @@ -78,24 +79,28 @@ If your data directory is not `/opt/trains`, then in the five `docker run` comma sudo docker run -d --restart="always" --name="trains-elastic" -e "bootstrap.memory_lock=true" --ulimit memlock=-1:-1 -e "ES_JAVA_OPTS=-Xms2g -Xmx2g" -e "bootstrap.memory_lock=true" -e "cluster.name=trains" -e "discovery.zen.minimum_master_nodes=1" -e "node.name=trains" -e "script.inline=true" -e "script.update=true" -e "thread_pool.bulk.queue_size=2000" -e "thread_pool.search.queue_size=10000" -e "xpack.security.enabled=false" -e "xpack.monitoring.enabled=false" -e "cluster.routing.allocation.node_initial_primaries_recoveries=500" -e "node.ingest=true" -e "http.compression_level=7" -e "reindex.remote.whitelist=*.*" -e "script.painless.regex.enabled=true" --network="host" -v /opt/trains/data/elastic:/usr/share/elasticsearch/data docker.elastic.co/elasticsearch/elasticsearch:5.6.16 -1. Launch the **trains-mongo** Docker container. +1. Launch the **trains-mongo** Docker container. sudo docker run -d --restart="always" --name="trains-mongo" -v /opt/trains/data/mongo/db:/data/db -v /opt/trains/data/mongo/configdb:/data/configdb --network="host" mongo:3.6.5 -1. Launch the **trains-fileserver** Docker container. +1. Launch the **trains-redis** Docker container. + + sudo docker run -d --restart="always" --name="trains-redis" -v /opt/trains/data/redis:/data --network="host" redis:5.0 + +1. Launch the **trains-fileserver** Docker container. sudo docker run -d --restart="always" --name="trains-fileserver" --network="host" -v /opt/trains/logs:/var/log/trains -v /opt/trains/data/fileserver:/mnt/fileserver allegroai/trains:latest fileserver -1. Launch the **trains-apiserver** Docker container. +1. Launch the **trains-apiserver** Docker container. sudo docker run -d --restart="always" --name="trains-apiserver" --network="host" -v /opt/trains/logs:/var/log/trains -v /opt/trains/config:/opt/trains/config allegroai/trains:latest apiserver -1. Launch the **trains-webserver** Docker container. +1. Launch the **trains-webserver** Docker container. sudo docker run -d --restart="always" --name="trains-webserver" -p 8080:80 allegroai/trains:latest webserver - + 1. Your server is now running on [http://localhost:8080](http://localhost:8080) and the following ports are available: - + * API server on port `8008` * Web server on port `8080` * File server on port `8081` diff --git a/docs/install_aws.md b/docs/install_aws.md index a426ccf..d8ba7ba 100644 --- a/docs/install_aws.md +++ b/docs/install_aws.md @@ -21,6 +21,27 @@ The minimum recommended instance type is **t3a.large** In order to upgrade **trains-server** on an existing EC2 instance based on one of these AMIs, SSH into the instance and follow the [upgrade instructions](../README.md#upgrade) for **trains-server**. +### Upgrading AMI's to v0.12 +**Including the automatically updated AMI** + +Version 0.12 introduced an additional REDIS docker to the trains-server setup. + +AMI upgrading instructions: + +1. SSH to the EC2 machine running one of the `Latest Version AMI's` +2. Execute the following bash commands + ```bash + sudo bash + echo "" >> /usr/bin/start_or_update_server.sh + echo "sudo mkdir -p \${datadir}/redis" >> /usr/bin/start_or_update_server.sh + echo "sudo docker stop trains-redis || true && sudo docker rm -v trains-redis || true" >> /usr/bin/start_or_update_server.sh + echo "echo never | sudo tee -a /sys/kernel/mm/transparent_hugepage/enabled" >> /usr/bin/start_or_update_server.sh + echo "sudo sysctl vm.overcommit_memory=1" >> /usr/bin/start_or_update_server.sh + echo "sudo docker run -d --restart=always --name=trains-redis -v \${datadir}/redis:/data --network=host redis:5 redis-server" >> /usr/bin/start_or_update_server.sh + ``` +3. Reboot the EC2 machine + + ## Released versions The following sections provide a list containing AMI Image ID per region for each released **trains-server** version. @@ -28,22 +49,40 @@ The following sections provide a list containing AMI Image ID per region for eac ### Latest Version AMI **For easier upgrades: The following AMI automatically update to the latest release every reboot** -* **eu-north-1** : ami-05d0d39ba39c93781 -* **ap-south-1** : ami-01ae99e1c27e0490a -* **eu-west-3** : ami-01b156f8c7dd38121 -* **eu-west-2** : ami-01b80d5a23b8847fb -* **eu-west-1** : ami-0524891495168c944 -* **ap-northeast-2** : ami-0594f00619bea922f -* **ap-northeast-1** : ami-0d97b860be6f71a9f -* **sa-east-1** : ami-0b0889651918730b8 -* **ca-central-1** : ami-040c641b2f71082b1 -* **ap-southeast-1** : ami-00a57be01d39ff964 -* **ap-southeast-2** : ami-066dcf2cc155b6ec1 -* **eu-central-1** : ami-0bb64c4bdecebc0a9 -* **us-east-2** : ami-04addd0766ebb8f46 -* **us-west-1** : ami-0ea895789568bb537 -* **us-west-2** : ami-07ae3d0dedfdb2278 -* **us-east-1** : ami-07fe3993427800995 +* **eu-north-1** : ami-072aef14041e70651 +* **ap-south-1** : ami-08032d881daca4de1 +* **eu-west-3** : ami-0b39c123d4343d408 +* **eu-west-2** : ami-0e0fe6fd14b2e9029 +* **eu-west-1** : ami-087c81e06d722e938 +* **ap-northeast-2** : ami-0caf74f03322b994c +* **ap-northeast-1** : ami-0f723b3d49c0f2749 +* **sa-east-1** : ami-0ac5595ad0e106502 +* **ca-central-1** : ami-053049b463869469a +* **ap-southeast-1** : ami-0b440ec389d6ff541 +* **ap-southeast-2** : ami-02af978ddc2c15b71 +* **eu-central-1** : ami-09ef364aa8df29760 +* **us-east-2** : ami-02e33f8ab77071509 +* **us-west-1** : ami-0ff33f256907fd460 +* **us-west-2** : ami-0387728fb09c8cda7 +* **us-east-1** : ami-02c47c5233eed7f88 + +### v0.12.0 +* **eu-north-1** : ami-0ebb4bb8637d0da65 +* **ap-south-1** : ami-0fb3c89eb8a8fc294 +* **eu-west-3** : ami-0b55ea4a6698d5875 +* **eu-west-2** : ami-02979b6d77856b842 +* **eu-west-1** : ami-07f4c17a636489574 +* **ap-northeast-2** : ami-06071092427dd5ab4 +* **ap-northeast-1** : ami-0fbacddfc0e8d2651 +* **sa-east-1** : ami-073590d3b3e6f4cfd +* **ca-central-1** : ami-0839610fc0101e41c +* **ap-southeast-1** : ami-0ff0adeef7f9fa879 +* **ap-southeast-2** : ami-03ed15d31bfc2844c +* **eu-central-1** : ami-0813c06d8b2462c39 +* **us-east-2** : ami-07c593425f988b054 +* **us-west-1** : ami-0eb0e13b1f06c03c0 +* **us-west-2** : ami-000568ca142798412 +* **us-east-1** : ami-062d9da44f96c8a87 ### v0.11.0 * **eu-north-1** : ami-0cbe338f058018c97