Update AWS images 0.12.1

Add resource monitoring
Optimize thread processing
2025-06-26 23:15:47 +00:00 · 2019-12-14 23:46:21 +02:00 · 2019-12-14 23:35:42 +02:00 · 2019-12-14 23:35:18 +02:00 · 2019-12-14 23:34:00 +02:00 · 2019-12-14 23:33:36 +02:00
31 changed files with 1140 additions and 241 deletions
--- a/README.md
+++ b/README.md
@@ -51,70 +51,84 @@ Use one of our pre-installed Amazon Machine Images for easy deployment in AWS.

 For details and instructions, see [TRAINS-server: AWS pre-installed images](docs/install_aws.md).

-## Docker Installation - Linux, Mac OS X <a name="installation"></a>
+## Docker Installation - Linux, macOS, and Windows <a name="installation"></a>

-Use our pre-built Docker image for easy deployment in Linux and Mac OS X.
-For Windows, we recommend installing our pre-built Docker image on a Linux virtual machine.
+Use our pre-built Docker image for easy deployment in Linux and macOS. <br>
+For [Windows](https://github.com/allegroai/trains-server/blob/master/docs/faq.md#docker_compose_win10), please see detailed docker-compose installation instructions on our [FAQ](https://github.com/allegroai/trains-server/blob/master/docs/faq.md#docker_compose_win10).<br>
 Latest docker images can be found [here](https://hub.docker.com/r/allegroai/trains).

-1. Setup Docker ([docker-compose Ubuntu](docs/faq.md#ubuntu), [docker-compose OS X](docs/faq.md#mac-osx), [Setup Docker Service Manually](docs/docker_setup.md#setup-docker))
+1. Setup Docker (docker-compose installation details: [Ubuntu](docs/faq.md#ubuntu) / [macOS](docs/faq.md#mac-osx))

-    Make sure port 8080/8081/8008 are available for the `trains-server` services
-
-    Increase vm.max_map_count for `ElasticSearch` docker
+    <details>
+    <summary>Make sure ports 8080/8081/8008 are available for the TRAINS-server services:</summary>
+   
+    For example, to see if port `8080` is in use: 

    ```bash
-    echo "vm.max_map_count=262144" > /tmp/99-trains.conf
-    sudo mv /tmp/99-trains.conf /etc/sysctl.d/99-trains.conf
-    sudo sysctl -w vm.max_map_count=262144
-
-    sudo service docker restart
+    $ sudo lsof -Pn -i4 | grep :8080 | grep LISTEN
    ```
+    
+    </details>
+    
+    Increase vm.max_map_count for `ElasticSearch` docker
+
+    - Linux
+        ```bash
+        $ echo "vm.max_map_count=262144" > /tmp/99-trains.conf
+        $ sudo mv /tmp/99-trains.conf /etc/sysctl.d/99-trains.conf
+        $ sudo sysctl -w vm.max_map_count=262144
+        $ sudo service docker restart
+        ```
+      
+    - macOS
+        ```bash
+        $ screen ~/Library/Containers/com.docker.docker/Data/vms/0/tty
+        $ sysctl -w vm.max_map_count=262144
+        ```    

 1. Create local directories for the databases and storage.

    ```bash
-    sudo mkdir -p /opt/trains/data/elastic
-    sudo mkdir -p /opt/trains/data/mongo/db
-    sudo mkdir -p /opt/trains/data/mongo/configdb
-    sudo mkdir -p /opt/trains/data/redis
-    sudo mkdir -p /opt/trains/logs
-    sudo mkdir -p /opt/trains/data/fileserver
-    sudo mkdir -p /opt/trains/config
+    $ sudo mkdir -p /opt/trains/data/elastic
+    $ sudo mkdir -p /opt/trains/data/mongo/db
+    $ sudo mkdir -p /opt/trains/data/mongo/configdb
+    $ sudo mkdir -p /opt/trains/data/redis
+    $ sudo mkdir -p /opt/trains/logs
+    $ sudo mkdir -p /opt/trains/data/fileserver
+    $ sudo mkdir -p /opt/trains/config
    ```

-    Linux
-    ```bash
-    $ sudo chown -R 1000:1000 /opt/trains
-    ```
-    Mac OS X
-    ```bash
-    $ sudo chown -R $(whoami):staff /opt/trains
-    ```
+    Set folder permissions
+      
+    - Linux
+      ```bash
+      $ sudo chown -R 1000:1000 /opt/trains
+      ```
+    - macOS
+      ```bash
+      $ sudo chown -R $(whoami):staff /opt/trains
+      ```

-1. Clone the [trains-server](https://github.com/allegroai/trains-server) repository and change directories to the new **trains-server** directory.
+1. Download the `docker-compose.yml` file, either download [manually](https://raw.githubusercontent.com/allegroai/trains-server/master/docker-compose.yml) or execute:

    ```bash
-    $ git clone https://github.com/allegroai/trains-server.git
-    $ cd trains-server
+    $ curl https://raw.githubusercontent.com/allegroai/trains-server/master/docker-compose.yml -o docker-compose.yml 
    ```

 1. Launch the Docker containers <a name="launch-docker"></a>

-    * Automatically with docker-compose (details: [Linux/Ubuntu](docs/faq.md#ubuntu), [OS X](docs/faq.md#mac-osx))
-
    ```bash
-    $ docker-compose up
+    $ docker-compose -f docker-compose.yml up
    ```

-    * Manually, see [Launching Docker Containers Manually](docs/docker_setup.md#launch) for instructions.
-
 1. Your server is now running on [http://localhost:8080](http://localhost:8080) and the following ports are available:

    * Web server on port `8080`
    * API server on port `8008`
    * File server on port `8081`

+**\* If something went wrong along the way, check our FAQ: [Docker Setup](docs/docker_setup.md#setup-docker), [Ubuntu Support](docs/faq.md#ubuntu), [macOS Support](docs/faq.md#mac-osx)**
+
 ## Optional Configuration

 The **trains-server** default configuration can be easily overridden using external configuration files. By default, the server will look for these files in `/opt/trains/config`.
@@ -128,30 +142,9 @@ You can configure the **trains-server** to allow only a specific set of users to

 Enable this feature by placing `apiserver.conf` file under `/opt/trains/config`.

+Sample `apiserver.conf` configuration file can be found [here](https://github.com/allegroai/trains-server/blob/master/docs/apiserver.conf)

-Sample fixed user configuration file `/opt/trains/config/apiserver.conf`:
-
-    auth {
-        # Fixed users login credetials
-        # No other user will be able to login
-        fixed_users {
-            enabled: true
-            users: [
-                {
-                    username: "jane"
-                    password: "12345678"
-                    name: "Jane Doe"
-                },
-                {
-                    username: "john"
-                    password: "12345678"
-                    name: "John Doe"
-                },
-            ]
-        }
-    }
-
-To apply the `apiserver.conf` changes, you must restart the *trains-apiserver* (docker) (see [Restarting trains-server](#restart-server)).
+To apply the changes, you must [restart the *trains-server*](#restart-server).

 ### Configuring the Non-Responsive Experiments Watchdog

@@ -160,30 +153,18 @@ and marks them as `aborted`. The watchdog is always active with a default of 720

 To change the watchdog's timeouts, place a `services.conf` file under `/opt/trains/config`.

-Sample watchdog configuration file `/opt/trains/config/services.conf`:
+Sample watchdog `services.conf` configuration file can be found [here](https://github.com/allegroai/trains-server/blob/master/docs/services.conf)

-    tasks {
-        non_responsive_tasks_watchdog {
-            # In-progress tasks that haven't been updated for at least 'value' seconds will be stopped by the watchdog
-            threshold_sec: 7200
-
-            # Watchdog will sleep for this number of seconds after each cycle
-            watch_interval_sec: 900
-        }
-    }
-
-To apply the `services.conf` changes, you must restart the *trains-apiserver* (docker) (see [Restarting trains-server](#restart-server)).
+To apply the changes, you must [restart the *trains-server*](#restart-server).

 ### Restarting trains-server <a name="restart-server"></a>

-To restart the **trains-server**, you must first stop and remove the containers, and then restart.
+To restart the **trains-server**, you must first stop the containers, and then restart them.
+   ```bash
+   $ docker-compose down
+   $ docker-compose -f docker-compose.yml up
+   ```

-1. Restarting docker-compose containers.
-
-        $ docker-compose down
-        $ docker-compose up
-
-1. Manually restarting dockers [instructions](docs/docker_setup.md#launch).

 ## Configuring **TRAINS** client

@@ -222,71 +203,42 @@ We are constantly updating, improving and adding to the **trains-server**.
 New releases will include new pre-built Docker images.
 When we release a new version and include a new pre-built Docker image for it, upgrade as follows:

-* Upgrading your docker-compose installation
+1. Shut down the docker containers
+   ```bash
+   $ docker-compose down
+   ```

-    * Shut down the docker containers
-    ```bash
-    $ docker-compose down
-    ```
-  
-    * We highly recommend backing up your data directory before upgrading     
-    (see **Step ii** in the Manual Docker upgrade)
+1. We highly recommend backing up your data directory before upgrading.

-    * Spin up the docker containers, it will automatically pull the latest trains-server build    
-    ```bash
-    $ docker-compose up
-    ```
+   Assuming your data directory is `/opt/trains`, to archive all data into `~/trains_backup.tgz` execute:

-    * In case of a docker error: "... The container name "/trains-???" is already in use by ..."      
-      Try removing deprecated images with:
-      ```bash
-      $ docker rm -f $(docker ps -a -q)
-      ```
+   ```bash
+   $ sudo tar czvf ~/trains_backup.tgz /opt/trains/data
+   ```    

-* Manual Docker upgrade
-    1. Shut down and remove each of your Docker instances using the following commands:
-    
-        ```bash
-        $ sudo docker stop <docker-name>
-        $ sudo docker rm -v <docker-name>
-        ```
-    
-        The Docker names are (see [Launching Docker Containers](#launch-docker)):
-    
-        * `trains-elastic`
-        * `trains-mongo`
-        * `trains-redis`
-        * `trains-fileserver`
-        * `trains-apiserver`
-        * `trains-webserver`
-    
-    2. We highly recommend backing up your data directory!. A simple way to do that is using `tar`:
-    
-        For example, if your data directory is `/opt/trains`, use the following command:
-    
-        ```bash
-        $ sudo tar czvf ~/trains_backup.tgz /opt/trains/data
-        ```
-        This backups all data to an archive in your home directory.
-    
-        To restore this example backup, use the following command:
-        ```bash
-        $ sudo rm -R /opt/trains/data
-        $ sudo tar -xzf ~/trains_backup.tgz -C /opt/trains/data
-        ```
-    
-    3. Pull the new **trains-server** docker image using the following command:
-    
-        ```bash
-        $ sudo docker pull allegroai/trains:latest
-        ```
-    
-        If you wish to pull a different version, replace `latest` with the required version number, for example:
-        ```bash
-        $ sudo docker pull allegroai/trains:0.11.0
-         ```
-    
-    4. Launch the newly released Docker image (see [Launching Docker Containers](#launch-docker)).
+   <details>
+   <summary>Restore instructions:</summary>
+
+   To restore this example backup, execute:
+   ```bash
+   $ sudo rm -R /opt/trains/data
+   $ sudo tar -xzf ~/trains_backup.tgz -C /opt/trains/data
+   ```
+   </details>
+
+1. Download the latest `docker-compose.yml` file, either [manually](https://raw.githubusercontent.com/allegroai/trains-server/master/docker-compose.yml) or execute:
+
+   ```bash
+   $ curl https://raw.githubusercontent.com/allegroai/trains-server/master/docker-compose.yml -o docker-compose.yml 
+   ```
+
+1. Spin up the docker containers, it will automatically pull the latest trains-server build    
+   ```bash
+   $ docker-compose -f docker-compose.yml pull
+   $ docker-compose -f docker-compose.yml up
+   ```
+
+**\* If something went wrong along the way, check our FAQ: [Docker Upgrade](docs/docker_setup.md#common-docker-upgrade-errors)**


 ## Community & Support
--- a/docker-compose-win10.yml
+++ b/docker-compose-win10.yml
@@ -0,0 +1,117 @@
+version: "3.6"
+services:
+
+  apiserver:
+    command:
+    - apiserver
+    container_name: trains-apiserver
+    image: allegroai/trains:latest
+    restart: unless-stopped
+    volumes:
+    - c:/opt/trains/logs:/var/log/trains
+    - c:/opt/trains/config:/opt/trains/config
+    depends_on:
+      - redis
+      - mongo
+      - elasticsearch
+      - fileserver
+    environment:
+      ELASTIC_SERVICE_HOST: elasticsearch
+      MONGODB_SERVICE_HOST: mongo
+      REDIS_SERVICE_HOST: redis
+    ports:
+    - "8008:8008"
+    networks:
+      - backend
+
+  elasticsearch:
+    networks:
+      - backend
+    container_name: trains-elastic
+    environment:
+      ES_JAVA_OPTS: -Xms2g -Xmx2g
+      bootstrap.memory_lock: "true"
+      cluster.name: trains
+      cluster.routing.allocation.node_initial_primaries_recoveries: "500"
+      discovery.zen.minimum_master_nodes: "1"
+      http.compression_level: "7"
+      node.ingest: "true"
+      node.name: trains
+      reindex.remote.whitelist: '*.*'
+      script.inline: "true"
+      script.painless.regex.enabled: "true"
+      script.update: "true"
+      thread_pool.bulk.queue_size: "2000"
+      thread_pool.search.queue_size: "10000"
+      xpack.monitoring.enabled: "false"
+      xpack.security.enabled: "false"
+    ulimits:
+      memlock:
+        soft: -1
+        hard: -1
+      nofile:
+        soft: 65536
+        hard: 65536
+    image: docker.elastic.co/elasticsearch/elasticsearch:5.6.16
+    restart: unless-stopped
+    volumes:
+    - c:/opt/trains/data/elastic:/usr/share/elasticsearch/data
+    ports:
+    - "9200:9200"
+
+  fileserver:
+    networks:
+      - backend
+    command:
+    - fileserver
+    container_name: trains-fileserver
+    image: allegroai/trains:latest
+    restart: unless-stopped
+    volumes:
+    - c:/opt/trains/logs:/var/log/trains
+    - c:/opt/trains/data/fileserver:/mnt/fileserver
+    ports:
+    - "8081:8081"
+
+  mongo:
+    networks:
+      - backend
+    container_name: trains-mongo
+    image: mongo:3.6.5
+    restart: unless-stopped
+    command: --setParameter internalQueryExecMaxBlockingSortBytes=196100200
+    volumes:
+    - mongodata:/data
+    ports:
+    - "27017:27017"
+
+  redis:
+    networks:
+      - backend
+    container_name: trains-redis
+    image: redis:5.0
+    restart: unless-stopped
+    volumes:
+    - c:/opt/trains/data/redis:/data
+    ports:
+    - "6379:6379"
+
+  webserver:
+    command:
+    - webserver
+    container_name: trains-webserver
+    image: allegroai/trains:latest
+    restart: unless-stopped
+    volumes:
+    - c:/trains/logs:/var/log/trains
+    depends_on:
+      - apiserver
+    ports:
+    - "8080:80"
+
+networks:
+  backend:
+    driver: bridge
+
+volumes:
+  mongodata:
--- a/docs/apiserver.conf
+++ b/docs/apiserver.conf
@@ -0,0 +1,19 @@
+auth {
+    # Fixed users login credetials
+    # No other user will be able to login
+    fixed_users {
+        enabled: true
+        users: [
+            {
+                username: "jane"
+                password: "12345678"
+                name: "Jane Doe"
+            },
+            {
+                username: "john"
+                password: "12345678"
+                name: "John Doe"
+            },
+        ]
+    }
+}
--- a/docs/docker_setup.md
+++ b/docs/docker_setup.md
@@ -58,15 +58,15 @@ Create this directory, and set its owner and group to `uid` 1000. The data store
 For example, if your data directory is `/opt/trains`, then use the following command:

 ```bash
-    sudo mkdir -p /opt/trains/data/elastic
-    sudo mkdir -p /opt/trains/data/mongo/db
-    sudo mkdir -p /opt/trains/data/mongo/configdb
-    sudo mkdir -p /opt/trains/data/redis
-    sudo mkdir -p /opt/trains/logs
-    sudo mkdir -p /opt/trains/data/fileserver
-    sudo mkdir -p /opt/trains/config
+sudo mkdir -p /opt/trains/data/elastic
+sudo mkdir -p /opt/trains/data/mongo/db
+sudo mkdir -p /opt/trains/data/mongo/configdb
+sudo mkdir -p /opt/trains/data/redis
+sudo mkdir -p /opt/trains/logs
+sudo mkdir -p /opt/trains/data/fileserver
+sudo mkdir -p /opt/trains/config

-    sudo chown -R 1000:1000 /opt/trains
+sudo chown -R 1000:1000 /opt/trains
 ```

 ## TRAINS-server: Manually Launching Docker Containers <a name="launch"></a>
@@ -104,3 +104,63 @@ If your data directory is not `/opt/trains`, then in the five `docker run` comma
    * API server on port `8008`
    * Web server on port `8080`
    * File server on port `8081`
+
+## Manually Upgrading TRAINS-server Containers <a name="upgrade"></a>
+
+We are constantly updating, improving and adding to the **trains-server**.
+New releases will include new pre-built Docker images.
+When we release a new version and include a new pre-built Docker image for it, upgrade as follows:
+
+1. Shut down and remove each of your Docker instances using the following commands:
+    
+    ```bash
+    $ sudo docker stop <docker-name>
+    $ sudo docker rm -v <docker-name>
+    ```
+
+    The Docker names are (see [Launching Docker Containers](#launch-docker)):
+
+    * `trains-elastic`
+    * `trains-mongo`
+    * `trains-redis`
+    * `trains-fileserver`
+    * `trains-apiserver`
+    * `trains-webserver`
+
+2. We highly recommend backing up your data directory!. A simple way to do that is using `tar`:
+
+    For example, if your data directory is `/opt/trains`, use the following command:
+
+    ```bash
+    $ sudo tar czvf ~/trains_backup.tgz /opt/trains/data
+    ```
+    This backups all data to an archive in your home directory.
+
+    To restore this example backup, use the following command:
+    ```bash
+    $ sudo rm -R /opt/trains/data
+    $ sudo tar -xzf ~/trains_backup.tgz -C /opt/trains/data
+    ```
+
+3. Pull the new **trains-server** docker image using the following command:
+
+    ```bash
+    $ sudo docker pull allegroai/trains:latest
+    ```
+
+    If you wish to pull a different version, replace `latest` with the required version number, for example:
+    ```bash
+    $ sudo docker pull allegroai/trains:0.11.0
+     ```
+
+4. Launch the newly released Docker image (see [Launching Docker Containers](#trains-server-manually-launching-docker-containers-)).
+
+
+#### Common Docker Upgrade Errors 
+
+* In case of a docker error: "... The container name "/trains-???" is already in use by ..."      
+    Try removing deprecated images with:
+    ```bash
+    $ docker rm -f $(docker ps -a -q)
+    ```
+  
--- a/docs/faq.md
+++ b/docs/faq.md
@@ -6,12 +6,15 @@

 * [Running trains-server on Mac OS X](#mac-osx)

+* [Running trains-server on Windows 10](#docker_compose_win10)
+
 * [Installing trains-server on stand alone Linux Ubuntu  systems ](#ubuntu)

 * [Resolving port conflicts preventing fixed users mode authentication and login](#port-conflict)

 * [Configuring trains-server for sub-domains and load balancers](#sub-domains)

+
 ### Deploying trains-server on Kubernetes clusters <a name="kubernetes"></a>

 **trains-server** supports Kubernetes. See [trains-server-k8s](https://github.com/allegroai/trains-server-k8s)
@@ -33,13 +36,14 @@ To install and configure **trains-server** on Mac OS X, follow the steps below.
 1. Configure [Docker](https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html#docker-cli-run-prod-mode).

        $ screen ~/Library/Containers/com.docker.docker/Data/vms/0/tty
-        sysctl -w vm.max_map_count=262144
+        $ sysctl -w vm.max_map_count=262144

 1. Create local directories for the databases and storage.

        $ sudo mkdir -p /opt/trains/data/elastic
        $ sudo mkdir -p /opt/trains/data/mongo/db
        $ sudo mkdir -p /opt/trains/data/mongo/configdb
+        $ sudo mkdir -p /opt/trains/data/redis
        $ sudo mkdir -p /opt/trains/logs
        $ sudo mkdir -p /opt/trains/config
        $ sudo mkdir -p /opt/trains/data/fileserver
@@ -58,6 +62,43 @@ To install and configure **trains-server** on Mac OS X, follow the steps below.

    Your server is now running on [http://localhost:8080](http://localhost:8080)

+### Running trains-server on Windows 10 <a name="docker_compose_win10"></a>
+
+You can run **trains-server** on Windows 10 using Docker Desktop for Windows (see the Docker [System Requirements](https://docs.docker.com/docker-for-windows/install/#system-requirements)).
+
+To run **trains-server** on Windows 10, follow the steps below.
+
+1. Install the Docker Desktop for Windows application by either:
+
+    * Following the [Install Docker Desktop on Windows](https://docs.docker.com/docker-for-windows/install/) instructions.
+    * Running the Docker installation [wizard](https://hub.docker.com/?overlay=onboarding).
+
+1. Increase the memory allocation in Docker Desktop to `4GB`.
+
+    1. In your Windows notification area (system tray), right click the Docker icon.
+    
+    1. Click *Settings*, *Advanced*, and then set the memory to at least `4096`. 
+    
+    1. Click *Apply*.
+
+1. Create local directories for data and logs. Open PowerShell and execute the following commands:
+
+        mkdir c:\opt\trains\logs
+        mkdir c:\opt\trains\config
+        mkdir c:\opt\trains\data
+        mkdir c:\opt\trains\data\elastic
+        mkdir c:\opt\trains\data\redis
+        mkdir c:\opt\trains\data\fileserver
+
+1. Save the **trains-server** docker-compose YAML file [docker-compose-win10.yml](https://raw.githubusercontent.com/allegroai/trains-server/master/docker-compose-win10.yml) as `c:\opt\trains\docker-compose.yml`.
+
+1. Run `docker-compose`. In PowerShell, execute the following commands:
+
+        cd c:\opt\trains\
+        docker-compose up
+
+    Your server is now running on [http://localhost:8080](http://localhost:8080)
+
 ### Installing trains-server on stand alone Linux Ubuntu systems <a name="ubuntu"></a>

 To install **trains-server** on a stand alone Linux Ubuntu, follow the steps belows.
--- a/docs/install_aws.md
+++ b/docs/install_aws.md
@@ -49,40 +49,58 @@ The following sections provide a list containing AMI Image ID per region for eac
 ### Latest Version AMI <a name="autoupdate"></a>
 **For easier upgrades: The following AMI automatically update to the latest release every reboot**

-* **eu-north-1** : ami-072aef14041e70651
-* **ap-south-1** : ami-08032d881daca4de1
-* **eu-west-3** : ami-0b39c123d4343d408
-* **eu-west-2** : ami-0e0fe6fd14b2e9029
-* **eu-west-1** : ami-087c81e06d722e938
-* **ap-northeast-2** : ami-0caf74f03322b994c
-* **ap-northeast-1** : ami-0f723b3d49c0f2749
-* **sa-east-1** : ami-0ac5595ad0e106502
-* **ca-central-1** : ami-053049b463869469a
-* **ap-southeast-1** : ami-0b440ec389d6ff541
-* **ap-southeast-2** : ami-02af978ddc2c15b71
-* **eu-central-1** : ami-09ef364aa8df29760
-* **us-east-2** : ami-02e33f8ab77071509
-* **us-west-1** : ami-0ff33f256907fd460
-* **us-west-2** : ami-0387728fb09c8cda7
-* **us-east-1** : ami-02c47c5233eed7f88
+* **eu-north-1** : ami-055909c1b9471451d
+* **ap-south-1** : ami-0476123cc77226faf
+* **eu-west-3** : ami-01df7d35ab63cca70
+* **eu-west-2** : ami-00e8004c11fd0228e
+* **eu-west-1** : ami-04293fbba6d3acad1
+* **ap-northeast-2** : ami-004331f9c5eb13e94
+* **ap-northeast-1** : ami-08cc80e2049b30e61
+* **sa-east-1** : ami-06d814a0b6ffa3153
+* **ca-central-1** : ami-069210ff757e9c1b7
+* **ap-southeast-1** : ami-0d12cc70d6e9c0f39
+* **ap-southeast-2** : ami-0b4615aa76c055267
+* **eu-central-1** : ami-06537f431e52e4763
+* **us-east-2** : ami-0c3cfbcb8e72ecfc5
+* **us-west-1** : ami-0d83de031b83b6880
+* **us-west-2** : ami-06968633c4f7187c4
+* **us-east-1** : ami-07ff2f5f7ef99e8f6
+
+### v0.12.1
+* **eu-north-1** : ami-003118a8103286d84
+* **ap-south-1** : ami-02dfe86baa48e096f
+* **eu-west-3** : ami-0cc1f01267d2a780d
+* **eu-west-2** : ami-0e4c8332e5ce09585
+* **eu-west-1** : ami-03459a2f0b0a3b1ab
+* **ap-northeast-2** : ami-08f6c2aed3a53f24c
+* **ap-northeast-1** : ami-0b798eab95a7c5435
+* **sa-east-1** : ami-0d3ee166c09f0d1b2
+* **ca-central-1** : ami-00a758c56bd63acd5
+* **ap-southeast-1** : ami-0be64d4988cd03fbb
+* **ap-southeast-2** : ami-02087310d43a63f31
+* **eu-central-1** : ami-097bbefeac0c74225
+* **us-east-2** : ami-07eda256712b90f4d
+* **us-west-1** : ami-02ef2b55cbd01c7df
+* **us-west-2** : ami-037c6176ef4735360
+* **us-east-1** : ami-08715c20c0e3f1c15

 ### v0.12.0
-* **eu-north-1** : ami-0ebb4bb8637d0da65
-* **ap-south-1** : ami-0fb3c89eb8a8fc294
-* **eu-west-3** : ami-0b55ea4a6698d5875
-* **eu-west-2** : ami-02979b6d77856b842
-* **eu-west-1** : ami-07f4c17a636489574
-* **ap-northeast-2** : ami-06071092427dd5ab4
-* **ap-northeast-1** : ami-0fbacddfc0e8d2651
-* **sa-east-1** : ami-073590d3b3e6f4cfd
-* **ca-central-1** : ami-0839610fc0101e41c
-* **ap-southeast-1** : ami-0ff0adeef7f9fa879
-* **ap-southeast-2** : ami-03ed15d31bfc2844c
-* **eu-central-1** : ami-0813c06d8b2462c39
-* **us-east-2** : ami-07c593425f988b054
-* **us-west-1** : ami-0eb0e13b1f06c03c0
-* **us-west-2** : ami-000568ca142798412
-* **us-east-1** : ami-062d9da44f96c8a87
+* **eu-north-1** : ami-03ff8ab48cd43e77e
+* **ap-south-1** : ami-079c1a41ff836487c
+* **eu-west-3** : ami-0121ef0398ae87ab0
+* **eu-west-2** : ami-09f0f97654d8c79de
+* **eu-west-1** : ami-0b7ba303f757bfcd9
+* **ap-northeast-2** : ami-053f416517b5f40a6
+* **ap-northeast-1** : ami-056dff06c698c2d9d
+* **sa-east-1** : ami-017ab655119258639
+* **ca-central-1** : ami-03bf5fa1d86ac97f6
+* **ap-southeast-1** : ami-0e667958002b0360c
+* **ap-southeast-2** : ami-091f1b69cb43b1933
+* **eu-central-1** : ami-068ec2f0e98c26541
+* **us-east-2** : ami-0524bbdc1b64ff83f
+* **us-west-1** : ami-0b4facd7534e393c9
+* **us-west-2** : ami-0018d5a7e58966848
+* **us-east-1** : ami-08f24178fc14a84d2

 ### v0.11.0
 * **eu-north-1** : ami-0cbe338f058018c97
--- a/docs/services.conf
+++ b/docs/services.conf
@@ -0,0 +1,9 @@
+tasks {
+    non_responsive_tasks_watchdog {
+        # In-progress tasks that haven't been updated for at least 'value' seconds will be stopped by the watchdog
+        threshold_sec: 7200
+
+        # Watchdog will sleep for this number of seconds after each cycle
+        watch_interval_sec: 900
+    }
+}
--- a/server/apimodels/server.py
+++ b/server/apimodels/server.py
@@ -0,0 +1,14 @@
+from jsonmodels.fields import BoolField, DateTimeField, StringField
+from jsonmodels.models import Base
+
+
+class ReportStatsOptionRequest(Base):
+    enabled = BoolField(default=None, nullable=True)
+
+
+class ReportStatsOptionResponse(Base):
+    supported = BoolField(default=True)
+    enabled = BoolField()
+    enabled_time = DateTimeField(nullable=True)
+    enabled_version = StringField(nullable=True)
+    enabled_user = StringField(nullable=True)
--- a/server/bll/queue/queue_metrics.py
+++ b/server/bll/queue/queue_metrics.py
@@ -161,7 +161,7 @@ class QueueMetrics:
        In case no queue ids are specified the avg across all the
        company queues is calculated for each metric
        """
-        # self._log_current_metrics(company_id, queue_ids=queue_ids)
+        # self._log_current_metrics(company, queue_ids=queue_ids)

        if from_date >= to_date:
            raise bad_request.FieldsValueError("from_date must be less than to_date")
--- a/server/bll/statistics/resource_monitor.py
+++ b/server/bll/statistics/resource_monitor.py
@@ -0,0 +1,87 @@
+from datetime import datetime
+import operator
+from threading import Thread, Lock
+from time import sleep
+
+import attr
+import psutil
+
+
+class ResourceMonitor(Thread):
+    @attr.s(auto_attribs=True)
+    class Sample:
+        cpu_usage: float = 0.0
+        mem_used_gb: float = 0
+        mem_free_gb: float = 0
+
+        @classmethod
+        def _apply(cls, op, *samples):
+            return cls(
+                **{
+                    field: op(*(getattr(sample, field) for sample in samples))
+                    for field in attr.fields_dict(cls)
+                }
+            )
+
+        def min(self, sample):
+            return self._apply(min, self, sample)
+
+        def max(self, sample):
+            return self._apply(max, self, sample)
+
+        def avg(self, sample, count):
+            res = self._apply(lambda x: x * count, self)
+            res = self._apply(operator.add, res, sample)
+            res = self._apply(lambda x: x / (count + 1), res)
+            return res
+
+    def __init__(self, sample_interval_sec=5):
+        super(ResourceMonitor, self).__init__(daemon=True)
+        self.sample_interval_sec = sample_interval_sec
+        self._lock = Lock()
+        self._clear()
+
+    def _clear(self):
+        sample = self._get_sample()
+        self._avg = sample
+        self._min = sample
+        self._max = sample
+        self._clear_time = datetime.utcnow()
+        self._count = 1
+
+    @classmethod
+    def _get_sample(cls) -> Sample:
+        return cls.Sample(
+            cpu_usage=psutil.cpu_percent(),
+            mem_used_gb=psutil.virtual_memory().used / (1024 ** 3),
+            mem_free_gb=psutil.virtual_memory().free / (1024 ** 3),
+        )
+
+    def run(self):
+        while True:
+            sample = self._get_sample()
+
+            with self._lock:
+                self._min = self._min.min(sample)
+                self._max = self._max.max(sample)
+                self._avg = self._avg.avg(sample, self._count)
+                self._count += 1
+
+            sleep(self.sample_interval_sec)
+
+    def get_stats(self) -> dict:
+        """ Returns current resource statistics and clears internal resource statistics """
+        with self._lock:
+            min_ = attr.asdict(self._min)
+            max_ = attr.asdict(self._max)
+            avg = attr.asdict(self._avg)
+            res = {
+                "interval_sec": (datetime.utcnow() - self._clear_time).total_seconds(),
+                "num_cores": psutil.cpu_count(),
+                **{
+                    k: {"min": v, "max": max_[k], "avg": avg[k]}
+                    for k, v in min_.items()
+                }
+            }
+            self._clear()
+        return res
--- a/server/bll/statistics/stats_reporter.py
+++ b/server/bll/statistics/stats_reporter.py
@@ -0,0 +1,306 @@
+import logging
+import queue
+import random
+import time
+from datetime import timedelta, datetime
+from time import sleep
+from typing import Sequence, Optional
+
+import dpath
+import requests
+from requests.adapters import HTTPAdapter
+from requests.packages.urllib3.util.retry import Retry
+
+from bll.query import Builder as QueryBuilder
+from bll.util import get_server_uuid
+from bll.workers import WorkerStats, WorkerBLL
+from config import config
+from config.info import get_deployment_type
+from database.model import Company, User
+from database.model.queue import Queue
+from database.model.task.task import Task
+from utilities import safe_get
+from utilities.json import dumps
+from utilities.threads_manager import ThreadsManager
+from version import __version__ as current_version
+from .resource_monitor import ResourceMonitor
+
+log = config.logger(__file__)
+
+worker_bll = WorkerBLL()
+
+
+class StatisticsReporter:
+    threads = ThreadsManager("Statistics", resource_monitor=ResourceMonitor)
+    send_queue = queue.Queue()
+    supported = config.get("apiserver.statistics.supported", True)
+
+    @classmethod
+    def start(cls):
+        cls.start_sender()
+        cls.start_reporter()
+
+    @classmethod
+    @threads.register("reporter", daemon=True)
+    def start_reporter(cls):
+        """
+        Periodically send statistics reports for companies who have opted in.
+        Note: in trains we usually have only a single company
+        """
+        if not cls.supported:
+            return
+
+        report_interval = timedelta(
+            hours=config.get("apiserver.statistics.report_interval_hours", 24)
+        )
+
+        while True:
+
+            sleep(report_interval.total_seconds())
+
+            try:
+                for company in Company.objects(
+                    defaults__stats_option__enabled=True
+                ).only("id"):
+                    stats = cls.get_statistics(company.id)
+                    cls.send_queue.put(stats)
+
+            except Exception as ex:
+                log.exception(f"Failed collecting stats: {str(ex)}")
+
+    @classmethod
+    @threads.register("sender", daemon=True)
+    def start_sender(cls):
+        if not cls.supported:
+            return
+
+        url = config.get("apiserver.statistics.url")
+
+        retries = config.get("apiserver.statistics.max_retries", 5)
+        max_backoff = config.get("apiserver.statistics.max_backoff_sec", 5)
+        session = requests.Session()
+        adapter = HTTPAdapter(max_retries=Retry(retries))
+        session.mount("http://", adapter)
+        session.mount("https://", adapter)
+        session.headers["Content-type"] = "application/json"
+
+        WarningFilter.attach()
+
+        while True:
+            try:
+                report = cls.send_queue.get()
+
+                # Set a random backoff factor each time we send a report
+                adapter.max_retries.backoff_factor = random.random() * max_backoff
+
+                session.post(url, data=dumps(report))
+
+            except Exception as ex:
+                pass
+
+    @classmethod
+    def get_statistics(cls, company_id: str) -> dict:
+        """
+        Returns a statistics report per company
+        """
+        return {
+            "time": datetime.utcnow(),
+            "company_id": company_id,
+            "server": {
+                "version": current_version,
+                "deployment": get_deployment_type(),
+                "uuid": get_server_uuid(),
+                "queues": {"count": Queue.objects(company=company_id).count()},
+                "users": {"count": User.objects(company=company_id).count()},
+                "resources": cls.threads.resource_monitor.get_stats(),
+                "experiments": next(
+                    iter(cls._get_experiments_stats(company_id).values()), {}
+                ),
+            },
+            "agents": cls._get_agents_statistics(company_id),
+        }
+
+    @classmethod
+    def _get_agents_statistics(cls, company_id: str) -> Sequence[dict]:
+        result = cls._get_resource_stats_per_agent(company_id, key="resources")
+        dpath.merge(
+            result, cls._get_experiments_stats_per_agent(company_id, key="experiments")
+        )
+        return [{"uuid": agent_id, **data} for agent_id, data in result.items()]
+
+    @classmethod
+    def _get_resource_stats_per_agent(cls, company_id: str, key: str) -> dict:
+        agent_resource_threshold_sec = timedelta(
+            hours=config.get("apiserver.statistics.report_interval_hours", 24)
+        ).total_seconds()
+        to_timestamp = int(time.time())
+        from_timestamp = to_timestamp - int(agent_resource_threshold_sec)
+        es_req = {
+            "size": 0,
+            "query": QueryBuilder.dates_range(from_timestamp, to_timestamp),
+            "aggs": {
+                "workers": {
+                    "terms": {"field": "worker"},
+                    "aggs": {
+                        "categories": {
+                            "terms": {"field": "category"},
+                            "aggs": {"count": {"cardinality": {"field": "variant"}}},
+                        },
+                        "metrics": {
+                            "terms": {"field": "metric"},
+                            "aggs": {
+                                "min": {"min": {"field": "value"}},
+                                "max": {"max": {"field": "value"}},
+                                "avg": {"avg": {"field": "value"}},
+                            },
+                        },
+                    },
+                }
+            },
+        }
+        res = cls._run_worker_stats_query(company_id, es_req)
+
+        def _get_cardinality_fields(categories: Sequence[dict]) -> dict:
+            names = {"cpu": "num_cores"}
+            return {
+                names[c["key"]]: safe_get(c, "count/value")
+                for c in categories
+                if c["key"] in names
+            }
+
+        def _get_metric_fields(metrics: Sequence[dict]) -> dict:
+            names = {
+                "cpu_usage": "cpu_usage",
+                "memory_used": "mem_used_gb",
+                "memory_free": "mem_free_gb",
+            }
+            return {
+                names[m["key"]]: {
+                    "min": safe_get(m, "min/value"),
+                    "max": safe_get(m, "max/value"),
+                    "avg": safe_get(m, "avg/value"),
+                }
+                for m in metrics
+                if m["key"] in names
+            }
+
+        buckets = safe_get(res, "aggregations/workers/buckets", default=[])
+        return {
+            b["key"]: {
+                key: {
+                    "interval_sec": agent_resource_threshold_sec,
+                    **_get_cardinality_fields(safe_get(b, "categories/buckets", [])),
+                    **_get_metric_fields(safe_get(b, "metrics/buckets", [])),
+                }
+            }
+            for b in buckets
+        }
+
+    @classmethod
+    def _get_experiments_stats_per_agent(cls, company_id: str, key: str) -> dict:
+        agent_relevant_threshold = timedelta(
+            days=config.get("apiserver.statistics.agent_relevant_threshold_days", 30)
+        )
+        to_timestamp = int(time.time())
+        from_timestamp = to_timestamp - int(agent_relevant_threshold.total_seconds())
+        workers = cls._get_active_workers(company_id, from_timestamp, to_timestamp)
+        if not workers:
+            return {}
+
+        stats = cls._get_experiments_stats(company_id, list(workers.keys()))
+        return {
+            worker_id: {key: {**workers[worker_id], **stat}}
+            for worker_id, stat in stats.items()
+        }
+
+    @classmethod
+    def _get_active_workers(
+        cls, company_id, from_timestamp: int, to_timestamp: int
+    ) -> dict:
+        es_req = {
+            "size": 0,
+            "query": QueryBuilder.dates_range(from_timestamp, to_timestamp),
+            "aggs": {
+                "workers": {
+                    "terms": {"field": "worker"},
+                    "aggs": {"last_activity_time": {"max": {"field": "timestamp"}}},
+                }
+            },
+        }
+        res = cls._run_worker_stats_query(company_id, es_req)
+        buckets = safe_get(res, "aggregations/workers/buckets", default=[])
+        return {
+            b["key"]: {"last_activity_time": b["last_activity_time"]["value"]}
+            for b in buckets
+        }
+
+    @classmethod
+    def _run_worker_stats_query(cls, company_id, es_req) -> dict:
+        return worker_bll.es_client.search(
+            index=f"{WorkerStats.worker_stats_prefix_for_company(company_id)}*",
+            doc_type="stat",
+            body=es_req,
+        )
+
+    @classmethod
+    def _get_experiments_stats(
+        cls, company_id, workers: Optional[Sequence] = None
+    ) -> dict:
+        pipeline = [
+            {
+                "$match": {
+                    "company": company_id,
+                    "started": {"$exists": True, "$ne": None},
+                    "last_update": {"$exists": True, "$ne": None},
+                    "status": {"$nin": ["created", "queued"]},
+                    **({"last_worker": {"$in": workers}} if workers else {}),
+                }
+            },
+            {
+                "$group": {
+                    "_id": "$last_worker" if workers else None,
+                    "count": {"$sum": 1},
+                    "avg_run_time_sec": {
+                        "$avg": {
+                            "$divide": [
+                                {"$subtract": ["$last_update", "$started"]},
+                                1000,
+                            ]
+                        }
+                    },
+                    "avg_iterations": {"$avg": "$last_iteration"},
+                }
+            },
+            {
+                "$project": {
+                    "count": 1,
+                    "avg_run_time_sec": {"$trunc": "$avg_run_time_sec"},
+                    "avg_iterations": {"$trunc": "$avg_iterations"},
+                }
+            },
+        ]
+        return {
+            group["_id"]: {k: v for k, v in group.items() if k != "_id"}
+            for group in Task.aggregate(*pipeline)
+        }
+
+
+class WarningFilter(logging.Filter):
+    @classmethod
+    def attach(cls):
+        from urllib3.connectionpool import (
+            ConnectionPool,
+        )  # required to make sure the logger is created
+
+        assert ConnectionPool  # make sure import is not optimized out
+
+        logging.getLogger("urllib3.connectionpool").addFilter(cls())
+
+    def filter(self, record):
+        if (
+            record.levelno == logging.WARNING
+            and len(record.args) > 2
+            and record.args[2] == "/stats"
+        ):
+            return False
+        return True
--- a/server/bll/task/task_bll.py
+++ b/server/bll/task/task_bll.py
@@ -29,7 +29,7 @@ from .utils import ChangeStatusRequest, validate_status_change


 class TaskBLL(object):
-    threads = ThreadsManager()
+    threads = ThreadsManager("TaskBLL")

    def __init__(self, events_es=None):
        self.events_es = (
--- a/server/bll/util.py
+++ b/server/bll/util.py
@@ -1,7 +1,9 @@
+import functools
 from operator import itemgetter
 from typing import Sequence, Optional, Callable, Tuple, Dict, Any, Set

 from database.model import AttributedDocument
+from database.model.settings import Settings


 def extract_properties_to_lists(
@@ -64,3 +66,8 @@ class SetFieldsResolver:
        in the format suitable for projection (dot separated)
        """
        return set(name.replace("__", ".") for name in self.fields.values())
+
+
+@functools.lru_cache()
+def get_server_uuid() -> Optional[str]:
+    return Settings.get_by_key("server.uuid")
--- a/server/bll/workers/init.py
+++ b/server/bll/workers/init.py
@@ -4,6 +4,7 @@ from typing import Sequence, Set, Optional

 import attr
 import elasticsearch.helpers
+
 import es_factory
 from apierrors import APIError
 from apierrors.errors import bad_request, server_error
@@ -22,10 +23,9 @@ from database.model.auth import User
 from database.model.company import Company
 from database.model.queue import Queue
 from database.model.task.task import Task
-from service_repo.redis_manager import redman
+from redis_manager import redman
 from timing_context import TimingContext
 from tools import safe_get
-
 from .stats import WorkerStats

 log = config.logger(__file__)
@@ -33,9 +33,9 @@ log = config.logger(__file__)

 class WorkerBLL:
    def __init__(self, es=None, redis=None):
-        self.es = es if es is not None else es_factory.connect("workers")
+        self.es_client = es if es is not None else es_factory.connect("workers")
        self.redis = redis if redis is not None else redman.connection("workers")
-        self._stats = WorkerStats(self.es)
+        self._stats = WorkerStats(self.es_client)

    @property
    def stats(self) -> WorkerStats:
@@ -396,7 +396,7 @@ class WorkerBLL:
                    for i, val in enumerate(value)
                )

-        es_res = elasticsearch.helpers.bulk(self.es, actions)
+        es_res = elasticsearch.helpers.bulk(self.es_client, actions)
        added, errors = es_res[:2]
        return (added == len(actions)) and not errors

--- a/server/config/default/apiserver.conf
+++ b/server/config/default/apiserver.conf
@@ -101,4 +101,18 @@
        # GET request timeout
        request_timeout_sec: 3.0
    }
+
+    statistics {
+        # Note: statistics are sent ONLY if the user has actively opted-in
+        supported: true
+
+        url: "https://updates.trains.allegro.ai/stats"
+
+        report_interval_hours: 24
+        agent_relevant_threshold_days: 30
+
+        max_retries: 5
+        max_backoff_sec: 5
+    }
+
 }
--- a/server/config/info.py
+++ b/server/config/info.py
@@ -1,5 +1,6 @@
 from functools import lru_cache
 from pathlib import Path
+from os import getenv

 root = Path(__file__).parent.parent

@@ -26,3 +27,17 @@ def get_commit_number():
        return (root / "COMMIT").read_text().strip()
    except FileNotFoundError:
        return ""
+
+
+@lru_cache()
+def get_deployment_type() -> str:
+    value = getenv("TRAINS_SERVER_DEPLOYMENT_TYPE")
+    if value:
+        return value
+
+    try:
+        value = (root / "DEPLOY").read_text().strip()
+    except FileNotFoundError:
+        pass
+
+    return value or "manual"
--- a/server/database/init.py
+++ b/server/database/init.py
@@ -1,5 +1,6 @@
 from os import getenv

+from boltons.iterutils import first
 from furl import furl
 from jsonmodels import models
 from jsonmodels.errors import ValidationError
@@ -11,14 +12,16 @@ from config import config
 from .defs import Database
 from .utils import get_items

-from boltons.iterutils import first
-
 log = config.logger("database")

 strict = config.get("apiserver.mongo.strict", True)

-OVERRIDE_HOST_ENV_KEY = ("MONGODB_SERVICE_HOST", "MONGODB_SERVICE_SERVICE_HOST")
-OVERRIDE_PORT_ENV_KEY = "MONGODB_SERVICE_PORT"
+OVERRIDE_HOST_ENV_KEY = (
+    "TRAINS_MONGODB_SERVICE_HOST",
+    "MONGODB_SERVICE_HOST",
+    "MONGODB_SERVICE_SERVICE_HOST",
+)
+OVERRIDE_PORT_ENV_KEY = ("TRAINS_MONGODB_SERVICE_PORT", "MONGODB_SERVICE_PORT")

 _entries = []

@@ -41,7 +44,7 @@ def initialize():
    if override_hostname:
        log.info(f"Using override mongodb host {override_hostname}")

-    override_port = getenv(OVERRIDE_PORT_ENV_KEY)
+    override_port = first(map(getenv, OVERRIDE_PORT_ENV_KEY), None)
    if override_port:
        log.info(f"Using override mongodb port {override_port}")

--- a/server/database/model/company.py
+++ b/server/database/model/company.py
@@ -1,23 +1,36 @@
-from mongoengine import Document, EmbeddedDocument, EmbeddedDocumentField, StringField, Q
+from mongoengine import (
+    Document,
+    EmbeddedDocument,
+    EmbeddedDocumentField,
+    StringField,
+    Q,
+    BooleanField,
+    DateTimeField,
+)

 from database import Database, strict
 from database.fields import StrippedStringField
 from database.model import DbModelMixin


+class ReportStatsOption(EmbeddedDocument):
+    enabled = BooleanField(default=False)  # opt-in for statistics reporting
+    enabled_version = StringField()  # server version when enabled
+    enabled_time = DateTimeField()  # time when enabled
+    enabled_user = StringField()  # ID of user who enabled
+
+
 class CompanyDefaults(EmbeddedDocument):
    cluster = StringField()
+    stats_option = EmbeddedDocumentField(ReportStatsOption, default=ReportStatsOption)


 class Company(DbModelMixin, Document):
-    meta = {
-        'db_alias': Database.backend,
-        'strict': strict,
-    }
+    meta = {"db_alias": Database.backend, "strict": strict}

    id = StringField(primary_key=True)
    name = StrippedStringField(unique=True, min_length=3)
-    defaults = EmbeddedDocumentField(CompanyDefaults)
+    defaults = EmbeddedDocumentField(CompanyDefaults, default=CompanyDefaults)

    @classmethod
    def _prepare_perm_query(cls, company, allow_public=False):
--- a/server/database/model/settings.py
+++ b/server/database/model/settings.py
@@ -0,0 +1,57 @@
+from typing import Any, Optional, Sequence, Tuple
+
+from mongoengine import Document, StringField, DynamicField, Q
+from mongoengine.errors import NotUniqueError
+
+from database import Database, strict
+from database.model import DbModelMixin
+
+
+class Settings(DbModelMixin, Document):
+    meta = {
+        "db_alias": Database.backend,
+        "strict": strict,
+    }
+
+    key = StringField(primary_key=True)
+    value = DynamicField()
+
+    @classmethod
+    def get_by_key(cls, key: str, default: Optional[Any] = None, sep: str = ".") -> Any:
+        key = key.strip(sep)
+        res = Settings.objects(key=key).first()
+        if not res:
+            return default
+        return res.value
+
+    @classmethod
+    def get_by_prefix(
+        cls, key_prefix: str, default: Optional[Any] = None, sep: str = "."
+    ) -> Sequence[Tuple[str, Any]]:
+        key_prefix = key_prefix.strip(sep)
+        query = Q(key=key_prefix) | Q(key__startswith=key_prefix + sep)
+        res = Settings.objects(query)
+        if not res:
+            return default
+        return [(x.key, x.value) for x in res]
+
+    @classmethod
+    def set_or_add_value(cls, key: str, value: Any, sep: str = ".") -> bool:
+        """ Sets a new value or adds a new key/value setting (if key does not exist) """
+        key = key.strip(sep)
+        res = Settings.objects(key=key).update(key=key, value=value, upsert=True)
+        # if Settings.objects(key=key).only("key"):
+        #
+        # else:
+        #     res = Settings(key=key, value=value).save()
+        return bool(res)
+
+    @classmethod
+    def add_value(cls, key: str, value: Any, sep: str = ".") -> bool:
+        """ Adds a new key/value settings. Fails if key already exists. """
+        key = key.strip(sep)
+        try:
+            res = Settings(key=key, value=value).save(force_insert=True)
+            return bool(res)
+        except NotUniqueError:
+            return False
--- a/server/es_factory.py
+++ b/server/es_factory.py
@@ -1,20 +1,25 @@
 from datetime import datetime
 from os import getenv

+from boltons.iterutils import first
 from elasticsearch import Elasticsearch, Transport

 from config import config

 log = config.logger(__file__)

-OVERRIDE_HOST_ENV_KEY = ("ELASTIC_SERVICE_HOST", "ELASTIC_SERVICE_SERVICE_HOST")
-OVERRIDE_PORT_ENV_KEY = "ELASTIC_SERVICE_PORT"
+OVERRIDE_HOST_ENV_KEY = (
+    "TRAINS_ELASTIC_SERVICE_HOST",
+    "ELASTIC_SERVICE_HOST",
+    "ELASTIC_SERVICE_SERVICE_HOST",
+)
+OVERRIDE_PORT_ENV_KEY = ("TRAINS_ELASTIC_SERVICE_PORT", "ELASTIC_SERVICE_PORT")

-OVERRIDE_HOST = next(filter(None, map(getenv, OVERRIDE_HOST_ENV_KEY)), None)
+OVERRIDE_HOST = first(filter(None, map(getenv, OVERRIDE_HOST_ENV_KEY)))
 if OVERRIDE_HOST:
    log.info(f"Using override elastic host {OVERRIDE_HOST}")

-OVERRIDE_PORT = getenv(OVERRIDE_PORT_ENV_KEY)
+OVERRIDE_PORT = first(filter(None, map(getenv, OVERRIDE_PORT_ENV_KEY)))
 if OVERRIDE_PORT:
    log.info(f"Using override elastic port {OVERRIDE_PORT}")

@@ -25,6 +30,7 @@ class MissingClusterConfiguration(Exception):
    """
    Exception when cluster configuration is not found in config files
    """
+
    pass


@@ -32,6 +38,7 @@ class InvalidClusterConfiguration(Exception):
    """
    Exception when cluster configuration does not contain required properties
    """
+
    pass


@@ -46,12 +53,14 @@ def connect(cluster_name):
    """
    if cluster_name not in _instances:
        cluster_config = get_cluster_config(cluster_name)
-        hosts = cluster_config.get('hosts', None)
+        hosts = cluster_config.get("hosts", None)
        if not hosts:
            raise InvalidClusterConfiguration(cluster_name)

-        args = cluster_config.get('args', {})
-        _instances[cluster_name] = Elasticsearch(hosts=hosts, transport_class=Transport, **args)
+        args = cluster_config.get("args", {})
+        _instances[cluster_name] = Elasticsearch(
+            hosts=hosts, transport_class=Transport, **args
+        )

    return _instances[cluster_name]

@@ -63,13 +72,13 @@ def get_cluster_config(cluster_name):
    :return: config section for the cluster
    :raises MissingClusterConfiguration: in case no config section is found for the cluster
    """
-    cluster_key = '.'.join(('hosts.elastic', cluster_name))
+    cluster_key = ".".join(("hosts.elastic", cluster_name))
    cluster_config = config.get(cluster_key, None)
    if not cluster_config:
        raise MissingClusterConfiguration(cluster_name)

    def set_host_prop(key, value):
-        for host in cluster_config.get('hosts', []):
+        for host in cluster_config.get("hosts", []):
            host[key] = value

    if OVERRIDE_HOST:
--- a/server/init_data.py
+++ b/server/init_data.py
@@ -1,6 +1,7 @@
 import importlib.util
 from datetime import datetime
 from pathlib import Path
+from uuid import uuid4

 import attr
 from furl import furl
@@ -15,6 +16,7 @@ from database.model.auth import Role
 from database.model.auth import User as AuthUser, Credentials
 from database.model.company import Company
 from database.model.queue import Queue
+from database.model.settings import Settings
 from database.model.user import User
 from database.model.version import Version as DatabaseVersion
 from elastic.apply_mappings import apply_mappings_to_host
@@ -109,10 +111,7 @@ def _ensure_user(user: FixedUser, company_id: str):
    data["email"] = f"{user.user_id}@example.com"
    data["role"] = Role.user

-    _ensure_auth_user(
-        user_data=data,
-        company_id=company_id,
-    )
+    _ensure_auth_user(user_data=data, company_id=company_id)

    given_name, _, family_name = user.name.partition(" ")

@@ -142,9 +141,7 @@ def _apply_migrations():
    try:
        new_scripts = {
            ver: path
-            for ver, path in (
-                (Version(f.stem), f) for f in migration_dir.glob("*.py")
-            )
+            for ver, path in ((Version(f.stem), f) for f in migration_dir.glob("*.py"))
            if ver > last_version
        }
    except ValueError as ex:
@@ -179,16 +176,30 @@ def _apply_migrations():
        ).save()


+def _ensure_uuid():
+    Settings.add_value("server.uuid", str(uuid4()))
+
+
 def init_mongo_data():
    try:
        _apply_migrations()

+        _ensure_uuid()
+
        company_id = _ensure_company()
        _ensure_default_queue(company_id)

        users = [
-            {"name": "apiserver", "role": Role.system, "email": "apiserver@example.com"},
-            {"name": "webserver", "role": Role.system, "email": "webserver@example.com"},
+            {
+                "name": "apiserver",
+                "role": Role.system,
+                "email": "apiserver@example.com",
+            },
+            {
+                "name": "webserver",
+                "role": Role.system,
+                "email": "webserver@example.com",
+            },
            {"name": "tests", "role": Role.user, "email": "tests@example.com"},
        ]

--- a/server/service_repo/redis_manager.py
+++ b/server/service_repo/redis_manager.py
@@ -2,21 +2,23 @@ import threading
 from os import getenv
 from time import sleep

-from apierrors.errors.server_error import ConfigError, GeneralError
-from config import config
+from boltons.iterutils import first
 from redis import StrictRedis
 from redis.sentinel import Sentinel, SentinelConnectionPool

+from apierrors.errors.server_error import ConfigError, GeneralError
+from config import config
+
 log = config.logger(__file__)

-OVERRIDE_HOST_ENV_KEY = "REDIS_SERVICE_HOST"
-OVERRIDE_PORT_ENV_KEY = "REDIS_SERVICE_PORT"
+OVERRIDE_HOST_ENV_KEY = ("TRAINS_REDIS_SERVICE_HOST", "REDIS_SERVICE_HOST")
+OVERRIDE_PORT_ENV_KEY = ("TRAINS_REDIS_SERVICE_PORT", "REDIS_SERVICE_PORT")

-OVERRIDE_HOST = getenv(OVERRIDE_HOST_ENV_KEY)
+OVERRIDE_HOST = first(filter(None, map(getenv, OVERRIDE_HOST_ENV_KEY)))
 if OVERRIDE_HOST:
    log.info(f"Using override redis host {OVERRIDE_HOST}")

-OVERRIDE_PORT = getenv(OVERRIDE_PORT_ENV_KEY)
+OVERRIDE_PORT = first(filter(None, map(getenv, OVERRIDE_PORT_ENV_KEY)))
 if OVERRIDE_PORT:
    log.info(f"Using override redis port {OVERRIDE_PORT}")

--- a/server/requirements.txt
+++ b/server/requirements.txt
@@ -28,3 +28,4 @@ semantic_version>=2.6.0,<3
 furl>=2.0.0
 redis>=2.10.5
 humanfriendly==4.18
+psutil>=5.6.5
--- a/server/schema/services/server.conf
+++ b/server/schema/services/server.conf
@@ -3,6 +3,25 @@ _default {
    internal: true
    allow_roles: ["root", "system"]
 }
+get_stats {
+    "2.1" {
+        description: "Get the server collected statistics."
+        request {
+            type: object
+            properties {
+                interval {
+                    description: "The period for statistics collection in seconds."
+                    type: long
+                }
+            }
+        }
+        response {
+            type: object
+            properties: {
+            }
+        }
+    }
+}
 config {
    "2.1" {
        description: "Get server configuration. Secure section is not returned."
@@ -66,3 +85,39 @@ endpoints {
        }
    }
 }
+report_stats_option {
+    "2.4" {
+        description: "Get or set the report statistics option per-company"
+        request {
+            type: object
+            properties {
+                enabled {
+                    description: "If provided, sets the report statistics option (true/false)"
+                    type: boolean
+                }
+            }
+        }
+        response {
+            type: object
+            properties {
+                enabled {
+                    description: "Returns the current report stats option value"
+                    type: boolean
+                }
+                enabled_time {
+                    description: "If enabled, returns the time at which option was enabled"
+                    type: string
+                    format: date-time
+                }
+                enabled_version {
+                    description: "If enabled, returns the server version at the time option was enabled"
+                    type: string
+                }
+                enabled_user {
+                    description: "If enabled, returns Id of the user who enabled the option"
+                    type: string
+                }
+            }
+        }
+    }
+}
--- a/server/server.py
+++ b/server/server.py
@@ -7,15 +7,15 @@ from werkzeug.exceptions import BadRequest

 import database
 from apierrors.base import BaseError
+from bll.statistics.stats_reporter import StatisticsReporter
 from config import config
+from init_data import init_es_data, init_mongo_data
 from service_repo import ServiceRepo, APICall
 from service_repo.auth import AuthType
 from service_repo.errors import PathParsingError
 from timing_context import TimingContext
-from utilities import json
-from init_data import init_es_data, init_mongo_data
 from updates import check_updates_thread
-
+from utilities import json

 app = Flask(__name__, static_url_path="/static")
 CORS(app, **config.get("apiserver.cors"))
@@ -38,6 +38,7 @@ log.info(f"Exposed Services: {' '.join(ServiceRepo.endpoint_names())}")


 check_updates_thread.start()
+StatisticsReporter.start()


@app.before_first_request
@@ -57,7 +58,9 @@ def before_request():
        content, content_type = ServiceRepo.handle_call(call)
        headers = {}
        if call.result.filename:
-            headers["Content-Disposition"] = f"attachment; filename={call.result.filename}"
+            headers[
+                "Content-Disposition"
+            ] = f"attachment; filename={call.result.filename}"

        if call.result.headers:
            headers.update(call.result.headers)
@@ -71,7 +74,9 @@ def before_request():
                if value is None:
                    response.set_cookie(key, "", expires=0)
                else:
-                    response.set_cookie(key, value, **config.get("apiserver.auth.cookies"))
+                    response.set_cookie(
+                        key, value, **config.get("apiserver.auth.cookies")
+                    )

        return response
    except Exception as ex:
--- a/server/service_repo/apicall.py
+++ b/server/service_repo/apicall.py
@@ -255,6 +255,7 @@ class MissingIdentity(Exception):
 class APICall(DataContainer):
    HEADER_AUTHORIZATION = "Authorization"
    HEADER_REAL_IP = "X-Real-IP"
+    HEADER_FORWARDED_FOR = "X-Forwarded-For"
    """ Standard headers """

    _transaction_headers = ("X-Trains-Trx",)
@@ -382,8 +383,13 @@ class APICall(DataContainer):

    @property
    def real_ip(self):
-        real_ip = self.get_header(self.HEADER_REAL_IP)
-        return real_ip or self._remote_addr or "untrackable"
+        """ Obtain visitor's IP address """
+        return (
+                self.get_header(self.HEADER_FORWARDED_FOR)
+                or self.get_header(self.HEADER_REAL_IP)
+                or self._remote_addr
+                or "untrackable"
+        )

    @property
    def failed(self):
--- a/server/services/events.py
+++ b/server/services/events.py
@@ -394,7 +394,7 @@ def get_task_plots_v1_7(call, company_id, req_model):

    task_bll.assert_exists(call.identity.company, task_id, allow_public=True)
    # events, next_scroll_id, total_events = event_bll.get_task_events(
-    #     company_id, task_id,
+    #     company, task_id,
    #     event_type="plot",
    #     sort=[{"iter": {"order": "desc"}}],
    #     last_iter_count=iters,
@@ -453,7 +453,7 @@ def get_debug_images_v1_7(call, company_id, req_model):

    task_bll.assert_exists(call.identity.company, task_id, allow_public=True)
    # events, next_scroll_id, total_events = event_bll.get_task_events(
-    #     company_id, task_id,
+    #     company, task_id,
    #     event_type="training_debug_image",
    #     sort=[{"iter": {"order": "desc"}}],
    #     last_iter_count=iters,
--- a/server/services/server/init.py
+++ b/server/services/server/init.py
@@ -1,8 +1,24 @@
+from datetime import datetime
+
 from pyhocon.config_tree import NoneValue

+from apierrors import errors
+from apimodels.server import ReportStatsOptionRequest, ReportStatsOptionResponse
+from bll.statistics.stats_reporter import StatisticsReporter
 from config import config
 from config.info import get_version, get_build_number, get_commit_number
+from database.errors import translate_errors_context
+from database.model import Company
+from database.model.company import ReportStatsOption
 from service_repo import ServiceRepo, APICall, endpoint
+from version import __version__ as current_version
+
+
+@endpoint("server.get_stats")
+def get_stats(call: APICall):
+    call.result.data = StatisticsReporter.get_statistics(
+        company_id=call.identity.company
+    )


@endpoint("server.config")
@@ -43,3 +59,35 @@ def info(call: APICall):
        "build": get_build_number(),
        "commit": get_commit_number(),
    }
+
+
+@endpoint(
+    "server.report_stats_option",
+    request_data_model=ReportStatsOptionRequest,
+    response_data_model=ReportStatsOptionResponse,
+)
+def report_stats(call: APICall, company: str, request: ReportStatsOptionRequest):
+    if not StatisticsReporter.supported:
+        result = ReportStatsOptionResponse(supported=False)
+    else:
+        enabled = request.enabled
+        with translate_errors_context():
+            query = Company.objects(id=company)
+            if enabled is None:
+                stats_option = query.first().defaults.stats_option
+            else:
+                stats_option = ReportStatsOption(
+                    enabled=enabled,
+                    enabled_time=datetime.utcnow(),
+                    enabled_version=current_version,
+                    enabled_user=call.identity.user,
+                )
+                updated = query.update(defaults__stats_option=stats_option)
+                if not updated:
+                    raise errors.server_error.InternalError(
+                        f"Failed setting report_stats to {enabled}"
+                    )
+
+        result = ReportStatsOptionResponse(**stats_option.to_mongo())
+
+    call.result.data_model = result
--- a/server/tests/automated/test_workers.py
+++ b/server/tests/automated/test_workers.py
@@ -108,7 +108,7 @@ class TestWorkersService(TestService):
        from_date = to_date - timedelta(days=1)

        # no variants
-        res = self.api.workers.get_stats(
+        res = self.api.workers.get_statistics(
            items=[
                dict(key="cpu_usage", aggregation="avg"),
                dict(key="cpu_usage", aggregation="max"),
@@ -142,7 +142,7 @@ class TestWorkersService(TestService):
        )

        # split by variants
-        res = self.api.workers.get_stats(
+        res = self.api.workers.get_statistics(
            items=[dict(key="cpu_usage", aggregation="avg")],
            from_date=from_date.timestamp(),
            to_date=to_date.timestamp(),
@@ -165,7 +165,7 @@ class TestWorkersService(TestService):

        assert all(_check_metric_and_variants(worker) for worker in res["workers"])

-        res = self.api.workers.get_stats(
+        res = self.api.workers.get_statistics(
            items=[dict(key="cpu_usage", aggregation="avg")],
            from_date=from_date.timestamp(),
            to_date=to_date.timestamp(),
@@ -183,12 +183,13 @@ class TestWorkersService(TestService):
        # run on an empty es db since we have no way
        # to pass non existing workers to this api
        # res = self.api.workers.get_activity_report(
-        #     from_date=from_date.timestamp(),
-        #     to_date=to_date.timestamp(),
+        #     from_timestamp=from_timestamp.timestamp(),
+        #     to_timestamp=to_timestamp.timestamp(),
        #     interval=20,
        # )

        self._simulate_workers()
+
        to_date = utc_now_tz_aware()
        from_date = to_date - timedelta(minutes=10)

--- a/server/updates.py
+++ b/server/updates.py
@@ -8,6 +8,7 @@ import requests
 from semantic_version import Version

 from config import config
+from database.model.settings import Settings
 from version import __version__ as current_version

 log = config.logger(__name__)
@@ -39,12 +40,15 @@ class CheckUpdatesThread(Thread):

    def _check_new_version_available(self) -> Optional[_VersionResponse]:
        url = config.get(
-            "apiserver.check_for_updates.url", "https://updates.trains.allegro.ai/updates"
+            "apiserver.check_for_updates.url",
+            "https://updates.trains.allegro.ai/updates",
        )

+        uid = Settings.get_by_key("server.uuid")
+
        response = requests.get(
            url,
-            json={"versions": {self.component_name: str(current_version)}},
+            json={"versions": {self.component_name: str(current_version)}, "uid": uid},
            timeout=float(
                config.get("apiserver.check_for_updates.request_timeout_sec", 3.0)
            ),
--- a/server/utilities/threads_manager.py
+++ b/server/utilities/threads_manager.py
@@ -1,14 +1,29 @@
 from functools import wraps
 from threading import Lock, Thread

-import attr

-
-@attr.s(auto_attribs=True)
 class ThreadsManager:
    objects = {}
    lock = Lock()

+    def __init__(self, name=None, **threads):
+        super(ThreadsManager, self).__init__()
+        self.name = name or self.__class__.name
+        self.objects = {}
+        self.lock = Lock()
+
+        for name, thread in threads.items():
+            if issubclass(thread, Thread):
+                thread = thread()
+                thread.start()
+            elif isinstance(thread, Thread):
+                if not thread.is_alive():
+                    thread.start()
+            else:
+                raise Exception(f"Expected thread or thread class ({name}): {thread}")
+
+            self.objects[name] = thread
+
    def register(self, thread_name, daemon=True):
        def decorator(f):
            @wraps(f)
@@ -17,7 +32,7 @@ class ThreadsManager:
                    thread = self.objects.get(thread_name)
                    if not thread:
                        thread = Thread(
-                            target=f, name=thread_name, args=args, kwargs=kwargs
+                            target=f, name=f"{self.name}_{thread_name}", args=args, kwargs=kwargs
                        )
                        thread.daemon = daemon
                        thread.start()
@@ -27,3 +42,13 @@ class ThreadsManager:
            return wrapper

        return decorator
+
+    def __getattr__(self, item):
+        if item in self.objects:
+            return self.objects[item]
+        return self.__getattribute__(item)
+
+    def __getitem__(self, item):
+        if item in self.objects:
+            return self.objects[item]
+        raise KeyError(item)
Author	SHA1	Message	Date
allegroai	2c3f0e4ba3	Update AWS images 0.12.1	2019-12-14 23:46:21 +02:00
allegroai	c48eb34d8d	Add resource monitoring	2019-12-14 23:35:42 +02:00
allegroai	49515e06e1	Optimize thread processing	2019-12-14 23:35:18 +02:00
allegroai	4a1d97c02f	typo	2019-12-14 23:34:00 +02:00
allegroai	6c6c1c3f41	Add server resource monitoring	2019-12-14 23:33:36 +02:00
allegroai	0ad687008c	Improve server update checks	2019-12-14 23:33:04 +02:00
Allegro AI	fe3dbc92dc	Update README.md	2019-11-19 00:14:45 +02:00
Allegro AI	dc53970ff0	Update README.md	2019-11-19 00:01:12 +02:00
Allegro AI	73592b991b	Update README.md	2019-11-16 00:10:19 +02:00
Allegro AI	47b981a993	Update README.md	2019-11-16 00:08:36 +02:00
Allegro AI	b500bcab0b	Update faq.md	2019-11-16 00:07:30 +02:00
allegroai	59e910db1a	Add docker-compose Windows support	2019-11-16 00:04:04 +02:00
allegroai	2ecb430f02	Documentation	2019-11-10 00:23:45 +02:00
Allegro AI	a08722e394	Update README.md	2019-11-10 00:18:16 +02:00
Allegro AI	67c210d9d7	Update README.md	2019-11-10 00:14:30 +02:00
Allegro AI	101ba540f4	Update README.md	2019-11-10 00:08:52 +02:00
Allegro AI	82fc28d477	Update README.md	2019-11-10 00:06:12 +02:00
Allegro AI	7b73f699d2	Update README.md	2019-11-10 00:05:21 +02:00
allegroai	a7e5380f67	Add configuration example, experiments watchdog	2019-11-10 00:03:57 +02:00
allegroai	bcade31786	Add configuration example, limit user login	2019-11-09 23:59:08 +02:00
Allegro AI	6b902f85f4	Update README.md	2019-11-09 23:54:59 +02:00
allegroai	6d4c974045	Documentation	2019-11-09 23:45:12 +02:00
allegroai	2346c6f3f5	Documentation	2019-11-09 23:19:21 +02:00
Allegro AI	82e51b4d36	Update README.md	2019-11-09 23:07:43 +02:00
allegroai	e63599254e	Documentation	2019-11-09 21:32:30 +02:00
allegroai	8e7e234161	Add finer control for mongo/elastic/redis host configuration	2019-11-09 21:29:23 +02:00
allegroai	17d94b26c3	Documentation	2019-11-06 12:25:39 +02:00