Compare commits

27 Commits

Author SHA1 Message Date
allegroai
2c3f0e4ba3 Update AWS images 0.12.1 2019-12-14 23:46:21 +02:00
allegroai
c48eb34d8d Add resource monitoring 2019-12-14 23:35:42 +02:00
allegroai
49515e06e1 Optimize thread processing 2019-12-14 23:35:18 +02:00
allegroai
4a1d97c02f typo 2019-12-14 23:34:00 +02:00
allegroai
6c6c1c3f41 Add server resource monitoring 2019-12-14 23:33:36 +02:00
allegroai
0ad687008c Improve server update checks 2019-12-14 23:33:04 +02:00
Allegro AI
fe3dbc92dc Update README.md 2019-11-19 00:14:45 +02:00
Allegro AI
dc53970ff0 Update README.md 2019-11-19 00:01:12 +02:00
Allegro AI
73592b991b Update README.md 2019-11-16 00:10:19 +02:00
Allegro AI
47b981a993 Update README.md 2019-11-16 00:08:36 +02:00
Allegro AI
b500bcab0b Update faq.md 2019-11-16 00:07:30 +02:00
allegroai
59e910db1a Add docker-compose Windows support 2019-11-16 00:04:04 +02:00
allegroai
2ecb430f02 Documentation 2019-11-10 00:23:45 +02:00
Allegro AI
a08722e394 Update README.md 2019-11-10 00:18:16 +02:00
Allegro AI
67c210d9d7 Update README.md 2019-11-10 00:14:30 +02:00
Allegro AI
101ba540f4 Update README.md 2019-11-10 00:08:52 +02:00
Allegro AI
82fc28d477 Update README.md 2019-11-10 00:06:12 +02:00
Allegro AI
7b73f699d2 Update README.md 2019-11-10 00:05:21 +02:00
allegroai
a7e5380f67 Add configuration example, experiments watchdog 2019-11-10 00:03:57 +02:00
allegroai
bcade31786 Add configuration example, limit user login 2019-11-09 23:59:08 +02:00
Allegro AI
6b902f85f4 Update README.md 2019-11-09 23:54:59 +02:00
allegroai
6d4c974045 Documentation 2019-11-09 23:45:12 +02:00
allegroai
2346c6f3f5 Documentation 2019-11-09 23:19:21 +02:00
Allegro AI
82e51b4d36 Update README.md 2019-11-09 23:07:43 +02:00
allegroai
e63599254e Documentation 2019-11-09 21:32:30 +02:00
allegroai
8e7e234161 Add finer control for mongo/elastic/redis host configuration 2019-11-09 21:29:23 +02:00
allegroai
17d94b26c3 Documentation 2019-11-06 12:25:39 +02:00
31 changed files with 1140 additions and 241 deletions

228
README.md
View File

@@ -51,70 +51,84 @@ Use one of our pre-installed Amazon Machine Images for easy deployment in AWS.
For details and instructions, see [TRAINS-server: AWS pre-installed images](docs/install_aws.md).
## Docker Installation - Linux, Mac OS X <a name="installation"></a>
## Docker Installation - Linux, macOS, and Windows <a name="installation"></a>
Use our pre-built Docker image for easy deployment in Linux and Mac OS X.
For Windows, we recommend installing our pre-built Docker image on a Linux virtual machine.
Use our pre-built Docker image for easy deployment in Linux and macOS. <br>
For [Windows](https://github.com/allegroai/trains-server/blob/master/docs/faq.md#docker_compose_win10), please see detailed docker-compose installation instructions on our [FAQ](https://github.com/allegroai/trains-server/blob/master/docs/faq.md#docker_compose_win10).<br>
Latest docker images can be found [here](https://hub.docker.com/r/allegroai/trains).
1. Setup Docker ([docker-compose Ubuntu](docs/faq.md#ubuntu), [docker-compose OS X](docs/faq.md#mac-osx), [Setup Docker Service Manually](docs/docker_setup.md#setup-docker))
1. Setup Docker (docker-compose installation details: [Ubuntu](docs/faq.md#ubuntu) / [macOS](docs/faq.md#mac-osx))
Make sure port 8080/8081/8008 are available for the `trains-server` services
Increase vm.max_map_count for `ElasticSearch` docker
<details>
<summary>Make sure ports 8080/8081/8008 are available for the TRAINS-server services:</summary>
For example, to see if port `8080` is in use:
```bash
echo "vm.max_map_count=262144" > /tmp/99-trains.conf
sudo mv /tmp/99-trains.conf /etc/sysctl.d/99-trains.conf
sudo sysctl -w vm.max_map_count=262144
sudo service docker restart
$ sudo lsof -Pn -i4 | grep :8080 | grep LISTEN
```
</details>
Increase vm.max_map_count for `ElasticSearch` docker
- Linux
```bash
$ echo "vm.max_map_count=262144" > /tmp/99-trains.conf
$ sudo mv /tmp/99-trains.conf /etc/sysctl.d/99-trains.conf
$ sudo sysctl -w vm.max_map_count=262144
$ sudo service docker restart
```
- macOS
```bash
$ screen ~/Library/Containers/com.docker.docker/Data/vms/0/tty
$ sysctl -w vm.max_map_count=262144
```
1. Create local directories for the databases and storage.
```bash
sudo mkdir -p /opt/trains/data/elastic
sudo mkdir -p /opt/trains/data/mongo/db
sudo mkdir -p /opt/trains/data/mongo/configdb
sudo mkdir -p /opt/trains/data/redis
sudo mkdir -p /opt/trains/logs
sudo mkdir -p /opt/trains/data/fileserver
sudo mkdir -p /opt/trains/config
$ sudo mkdir -p /opt/trains/data/elastic
$ sudo mkdir -p /opt/trains/data/mongo/db
$ sudo mkdir -p /opt/trains/data/mongo/configdb
$ sudo mkdir -p /opt/trains/data/redis
$ sudo mkdir -p /opt/trains/logs
$ sudo mkdir -p /opt/trains/data/fileserver
$ sudo mkdir -p /opt/trains/config
```
Linux
```bash
$ sudo chown -R 1000:1000 /opt/trains
```
Mac OS X
```bash
$ sudo chown -R $(whoami):staff /opt/trains
```
Set folder permissions
- Linux
```bash
$ sudo chown -R 1000:1000 /opt/trains
```
- macOS
```bash
$ sudo chown -R $(whoami):staff /opt/trains
```
1. Clone the [trains-server](https://github.com/allegroai/trains-server) repository and change directories to the new **trains-server** directory.
1. Download the `docker-compose.yml` file, either download [manually](https://raw.githubusercontent.com/allegroai/trains-server/master/docker-compose.yml) or execute:
```bash
$ git clone https://github.com/allegroai/trains-server.git
$ cd trains-server
$ curl https://raw.githubusercontent.com/allegroai/trains-server/master/docker-compose.yml -o docker-compose.yml
```
1. Launch the Docker containers <a name="launch-docker"></a>
* Automatically with docker-compose (details: [Linux/Ubuntu](docs/faq.md#ubuntu), [OS X](docs/faq.md#mac-osx))
```bash
$ docker-compose up
$ docker-compose -f docker-compose.yml up
```
* Manually, see [Launching Docker Containers Manually](docs/docker_setup.md#launch) for instructions.
1. Your server is now running on [http://localhost:8080](http://localhost:8080) and the following ports are available:
* Web server on port `8080`
* API server on port `8008`
* File server on port `8081`
**\* If something went wrong along the way, check our FAQ: [Docker Setup](docs/docker_setup.md#setup-docker), [Ubuntu Support](docs/faq.md#ubuntu), [macOS Support](docs/faq.md#mac-osx)**
## Optional Configuration
The **trains-server** default configuration can be easily overridden using external configuration files. By default, the server will look for these files in `/opt/trains/config`.
@@ -128,30 +142,9 @@ You can configure the **trains-server** to allow only a specific set of users to
Enable this feature by placing `apiserver.conf` file under `/opt/trains/config`.
Sample `apiserver.conf` configuration file can be found [here](https://github.com/allegroai/trains-server/blob/master/docs/apiserver.conf)
Sample fixed user configuration file `/opt/trains/config/apiserver.conf`:
auth {
# Fixed users login credetials
# No other user will be able to login
fixed_users {
enabled: true
users: [
{
username: "jane"
password: "12345678"
name: "Jane Doe"
},
{
username: "john"
password: "12345678"
name: "John Doe"
},
]
}
}
To apply the `apiserver.conf` changes, you must restart the *trains-apiserver* (docker) (see [Restarting trains-server](#restart-server)).
To apply the changes, you must [restart the *trains-server*](#restart-server).
### Configuring the Non-Responsive Experiments Watchdog
@@ -160,30 +153,18 @@ and marks them as `aborted`. The watchdog is always active with a default of 720
To change the watchdog's timeouts, place a `services.conf` file under `/opt/trains/config`.
Sample watchdog configuration file `/opt/trains/config/services.conf`:
Sample watchdog `services.conf` configuration file can be found [here](https://github.com/allegroai/trains-server/blob/master/docs/services.conf)
tasks {
non_responsive_tasks_watchdog {
# In-progress tasks that haven't been updated for at least 'value' seconds will be stopped by the watchdog
threshold_sec: 7200
# Watchdog will sleep for this number of seconds after each cycle
watch_interval_sec: 900
}
}
To apply the `services.conf` changes, you must restart the *trains-apiserver* (docker) (see [Restarting trains-server](#restart-server)).
To apply the changes, you must [restart the *trains-server*](#restart-server).
### Restarting trains-server <a name="restart-server"></a>
To restart the **trains-server**, you must first stop and remove the containers, and then restart.
To restart the **trains-server**, you must first stop the containers, and then restart them.
```bash
$ docker-compose down
$ docker-compose -f docker-compose.yml up
```
1. Restarting docker-compose containers.
$ docker-compose down
$ docker-compose up
1. Manually restarting dockers [instructions](docs/docker_setup.md#launch).
## Configuring **TRAINS** client
@@ -222,71 +203,42 @@ We are constantly updating, improving and adding to the **trains-server**.
New releases will include new pre-built Docker images.
When we release a new version and include a new pre-built Docker image for it, upgrade as follows:
* Upgrading your docker-compose installation
1. Shut down the docker containers
```bash
$ docker-compose down
```
* Shut down the docker containers
```bash
$ docker-compose down
```
* We highly recommend backing up your data directory before upgrading
(see **Step ii** in the Manual Docker upgrade)
1. We highly recommend backing up your data directory before upgrading.
* Spin up the docker containers, it will automatically pull the latest trains-server build
```bash
$ docker-compose up
```
Assuming your data directory is `/opt/trains`, to archive all data into `~/trains_backup.tgz` execute:
* In case of a docker error: "... The container name "/trains-???" is already in use by ..."
Try removing deprecated images with:
```bash
$ docker rm -f $(docker ps -a -q)
```
```bash
$ sudo tar czvf ~/trains_backup.tgz /opt/trains/data
```
* Manual Docker upgrade
1. Shut down and remove each of your Docker instances using the following commands:
```bash
$ sudo docker stop <docker-name>
$ sudo docker rm -v <docker-name>
```
The Docker names are (see [Launching Docker Containers](#launch-docker)):
* `trains-elastic`
* `trains-mongo`
* `trains-redis`
* `trains-fileserver`
* `trains-apiserver`
* `trains-webserver`
2. We highly recommend backing up your data directory!. A simple way to do that is using `tar`:
For example, if your data directory is `/opt/trains`, use the following command:
```bash
$ sudo tar czvf ~/trains_backup.tgz /opt/trains/data
```
This backups all data to an archive in your home directory.
To restore this example backup, use the following command:
```bash
$ sudo rm -R /opt/trains/data
$ sudo tar -xzf ~/trains_backup.tgz -C /opt/trains/data
```
3. Pull the new **trains-server** docker image using the following command:
```bash
$ sudo docker pull allegroai/trains:latest
```
If you wish to pull a different version, replace `latest` with the required version number, for example:
```bash
$ sudo docker pull allegroai/trains:0.11.0
```
4. Launch the newly released Docker image (see [Launching Docker Containers](#launch-docker)).
<details>
<summary>Restore instructions:</summary>
To restore this example backup, execute:
```bash
$ sudo rm -R /opt/trains/data
$ sudo tar -xzf ~/trains_backup.tgz -C /opt/trains/data
```
</details>
1. Download the latest `docker-compose.yml` file, either [manually](https://raw.githubusercontent.com/allegroai/trains-server/master/docker-compose.yml) or execute:
```bash
$ curl https://raw.githubusercontent.com/allegroai/trains-server/master/docker-compose.yml -o docker-compose.yml
```
1. Spin up the docker containers, it will automatically pull the latest trains-server build
```bash
$ docker-compose -f docker-compose.yml pull
$ docker-compose -f docker-compose.yml up
```
**\* If something went wrong along the way, check our FAQ: [Docker Upgrade](docs/docker_setup.md#common-docker-upgrade-errors)**
## Community & Support

117
docker-compose-win10.yml Normal file
View File

@@ -0,0 +1,117 @@
version: "3.6"
services:
apiserver:
command:
- apiserver
container_name: trains-apiserver
image: allegroai/trains:latest
restart: unless-stopped
volumes:
- c:/opt/trains/logs:/var/log/trains
- c:/opt/trains/config:/opt/trains/config
depends_on:
- redis
- mongo
- elasticsearch
- fileserver
environment:
ELASTIC_SERVICE_HOST: elasticsearch
MONGODB_SERVICE_HOST: mongo
REDIS_SERVICE_HOST: redis
ports:
- "8008:8008"
networks:
- backend
elasticsearch:
networks:
- backend
container_name: trains-elastic
environment:
ES_JAVA_OPTS: -Xms2g -Xmx2g
bootstrap.memory_lock: "true"
cluster.name: trains
cluster.routing.allocation.node_initial_primaries_recoveries: "500"
discovery.zen.minimum_master_nodes: "1"
http.compression_level: "7"
node.ingest: "true"
node.name: trains
reindex.remote.whitelist: '*.*'
script.inline: "true"
script.painless.regex.enabled: "true"
script.update: "true"
thread_pool.bulk.queue_size: "2000"
thread_pool.search.queue_size: "10000"
xpack.monitoring.enabled: "false"
xpack.security.enabled: "false"
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 65536
hard: 65536
image: docker.elastic.co/elasticsearch/elasticsearch:5.6.16
restart: unless-stopped
volumes:
- c:/opt/trains/data/elastic:/usr/share/elasticsearch/data
ports:
- "9200:9200"
fileserver:
networks:
- backend
command:
- fileserver
container_name: trains-fileserver
image: allegroai/trains:latest
restart: unless-stopped
volumes:
- c:/opt/trains/logs:/var/log/trains
- c:/opt/trains/data/fileserver:/mnt/fileserver
ports:
- "8081:8081"
mongo:
networks:
- backend
container_name: trains-mongo
image: mongo:3.6.5
restart: unless-stopped
command: --setParameter internalQueryExecMaxBlockingSortBytes=196100200
volumes:
- mongodata:/data
ports:
- "27017:27017"
redis:
networks:
- backend
container_name: trains-redis
image: redis:5.0
restart: unless-stopped
volumes:
- c:/opt/trains/data/redis:/data
ports:
- "6379:6379"
webserver:
command:
- webserver
container_name: trains-webserver
image: allegroai/trains:latest
restart: unless-stopped
volumes:
- c:/trains/logs:/var/log/trains
depends_on:
- apiserver
ports:
- "8080:80"
networks:
backend:
driver: bridge
volumes:
mongodata:

19
docs/apiserver.conf Normal file
View File

@@ -0,0 +1,19 @@
auth {
# Fixed users login credetials
# No other user will be able to login
fixed_users {
enabled: true
users: [
{
username: "jane"
password: "12345678"
name: "Jane Doe"
},
{
username: "john"
password: "12345678"
name: "John Doe"
},
]
}
}

View File

@@ -58,15 +58,15 @@ Create this directory, and set its owner and group to `uid` 1000. The data store
For example, if your data directory is `/opt/trains`, then use the following command:
```bash
sudo mkdir -p /opt/trains/data/elastic
sudo mkdir -p /opt/trains/data/mongo/db
sudo mkdir -p /opt/trains/data/mongo/configdb
sudo mkdir -p /opt/trains/data/redis
sudo mkdir -p /opt/trains/logs
sudo mkdir -p /opt/trains/data/fileserver
sudo mkdir -p /opt/trains/config
sudo mkdir -p /opt/trains/data/elastic
sudo mkdir -p /opt/trains/data/mongo/db
sudo mkdir -p /opt/trains/data/mongo/configdb
sudo mkdir -p /opt/trains/data/redis
sudo mkdir -p /opt/trains/logs
sudo mkdir -p /opt/trains/data/fileserver
sudo mkdir -p /opt/trains/config
sudo chown -R 1000:1000 /opt/trains
sudo chown -R 1000:1000 /opt/trains
```
## TRAINS-server: Manually Launching Docker Containers <a name="launch"></a>
@@ -104,3 +104,63 @@ If your data directory is not `/opt/trains`, then in the five `docker run` comma
* API server on port `8008`
* Web server on port `8080`
* File server on port `8081`
## Manually Upgrading TRAINS-server Containers <a name="upgrade"></a>
We are constantly updating, improving and adding to the **trains-server**.
New releases will include new pre-built Docker images.
When we release a new version and include a new pre-built Docker image for it, upgrade as follows:
1. Shut down and remove each of your Docker instances using the following commands:
```bash
$ sudo docker stop <docker-name>
$ sudo docker rm -v <docker-name>
```
The Docker names are (see [Launching Docker Containers](#launch-docker)):
* `trains-elastic`
* `trains-mongo`
* `trains-redis`
* `trains-fileserver`
* `trains-apiserver`
* `trains-webserver`
2. We highly recommend backing up your data directory!. A simple way to do that is using `tar`:
For example, if your data directory is `/opt/trains`, use the following command:
```bash
$ sudo tar czvf ~/trains_backup.tgz /opt/trains/data
```
This backups all data to an archive in your home directory.
To restore this example backup, use the following command:
```bash
$ sudo rm -R /opt/trains/data
$ sudo tar -xzf ~/trains_backup.tgz -C /opt/trains/data
```
3. Pull the new **trains-server** docker image using the following command:
```bash
$ sudo docker pull allegroai/trains:latest
```
If you wish to pull a different version, replace `latest` with the required version number, for example:
```bash
$ sudo docker pull allegroai/trains:0.11.0
```
4. Launch the newly released Docker image (see [Launching Docker Containers](#trains-server-manually-launching-docker-containers-)).
#### Common Docker Upgrade Errors
* In case of a docker error: "... The container name "/trains-???" is already in use by ..."
Try removing deprecated images with:
```bash
$ docker rm -f $(docker ps -a -q)
```

View File

@@ -6,12 +6,15 @@
* [Running trains-server on Mac OS X](#mac-osx)
* [Running trains-server on Windows 10](#docker_compose_win10)
* [Installing trains-server on stand alone Linux Ubuntu systems ](#ubuntu)
* [Resolving port conflicts preventing fixed users mode authentication and login](#port-conflict)
* [Configuring trains-server for sub-domains and load balancers](#sub-domains)
### Deploying trains-server on Kubernetes clusters <a name="kubernetes"></a>
**trains-server** supports Kubernetes. See [trains-server-k8s](https://github.com/allegroai/trains-server-k8s)
@@ -33,13 +36,14 @@ To install and configure **trains-server** on Mac OS X, follow the steps below.
1. Configure [Docker](https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html#docker-cli-run-prod-mode).
$ screen ~/Library/Containers/com.docker.docker/Data/vms/0/tty
sysctl -w vm.max_map_count=262144
$ sysctl -w vm.max_map_count=262144
1. Create local directories for the databases and storage.
$ sudo mkdir -p /opt/trains/data/elastic
$ sudo mkdir -p /opt/trains/data/mongo/db
$ sudo mkdir -p /opt/trains/data/mongo/configdb
$ sudo mkdir -p /opt/trains/data/redis
$ sudo mkdir -p /opt/trains/logs
$ sudo mkdir -p /opt/trains/config
$ sudo mkdir -p /opt/trains/data/fileserver
@@ -58,6 +62,43 @@ To install and configure **trains-server** on Mac OS X, follow the steps below.
Your server is now running on [http://localhost:8080](http://localhost:8080)
### Running trains-server on Windows 10 <a name="docker_compose_win10"></a>
You can run **trains-server** on Windows 10 using Docker Desktop for Windows (see the Docker [System Requirements](https://docs.docker.com/docker-for-windows/install/#system-requirements)).
To run **trains-server** on Windows 10, follow the steps below.
1. Install the Docker Desktop for Windows application by either:
* Following the [Install Docker Desktop on Windows](https://docs.docker.com/docker-for-windows/install/) instructions.
* Running the Docker installation [wizard](https://hub.docker.com/?overlay=onboarding).
1. Increase the memory allocation in Docker Desktop to `4GB`.
1. In your Windows notification area (system tray), right click the Docker icon.
1. Click *Settings*, *Advanced*, and then set the memory to at least `4096`.
1. Click *Apply*.
1. Create local directories for data and logs. Open PowerShell and execute the following commands:
mkdir c:\opt\trains\logs
mkdir c:\opt\trains\config
mkdir c:\opt\trains\data
mkdir c:\opt\trains\data\elastic
mkdir c:\opt\trains\data\redis
mkdir c:\opt\trains\data\fileserver
1. Save the **trains-server** docker-compose YAML file [docker-compose-win10.yml](https://raw.githubusercontent.com/allegroai/trains-server/master/docker-compose-win10.yml) as `c:\opt\trains\docker-compose.yml`.
1. Run `docker-compose`. In PowerShell, execute the following commands:
cd c:\opt\trains\
docker-compose up
Your server is now running on [http://localhost:8080](http://localhost:8080)
### Installing trains-server on stand alone Linux Ubuntu systems <a name="ubuntu"></a>
To install **trains-server** on a stand alone Linux Ubuntu, follow the steps belows.

View File

@@ -49,40 +49,58 @@ The following sections provide a list containing AMI Image ID per region for eac
### Latest Version AMI <a name="autoupdate"></a>
**For easier upgrades: The following AMI automatically update to the latest release every reboot**
* **eu-north-1** : ami-072aef14041e70651
* **ap-south-1** : ami-08032d881daca4de1
* **eu-west-3** : ami-0b39c123d4343d408
* **eu-west-2** : ami-0e0fe6fd14b2e9029
* **eu-west-1** : ami-087c81e06d722e938
* **ap-northeast-2** : ami-0caf74f03322b994c
* **ap-northeast-1** : ami-0f723b3d49c0f2749
* **sa-east-1** : ami-0ac5595ad0e106502
* **ca-central-1** : ami-053049b463869469a
* **ap-southeast-1** : ami-0b440ec389d6ff541
* **ap-southeast-2** : ami-02af978ddc2c15b71
* **eu-central-1** : ami-09ef364aa8df29760
* **us-east-2** : ami-02e33f8ab77071509
* **us-west-1** : ami-0ff33f256907fd460
* **us-west-2** : ami-0387728fb09c8cda7
* **us-east-1** : ami-02c47c5233eed7f88
* **eu-north-1** : ami-055909c1b9471451d
* **ap-south-1** : ami-0476123cc77226faf
* **eu-west-3** : ami-01df7d35ab63cca70
* **eu-west-2** : ami-00e8004c11fd0228e
* **eu-west-1** : ami-04293fbba6d3acad1
* **ap-northeast-2** : ami-004331f9c5eb13e94
* **ap-northeast-1** : ami-08cc80e2049b30e61
* **sa-east-1** : ami-06d814a0b6ffa3153
* **ca-central-1** : ami-069210ff757e9c1b7
* **ap-southeast-1** : ami-0d12cc70d6e9c0f39
* **ap-southeast-2** : ami-0b4615aa76c055267
* **eu-central-1** : ami-06537f431e52e4763
* **us-east-2** : ami-0c3cfbcb8e72ecfc5
* **us-west-1** : ami-0d83de031b83b6880
* **us-west-2** : ami-06968633c4f7187c4
* **us-east-1** : ami-07ff2f5f7ef99e8f6
### v0.12.1
* **eu-north-1** : ami-003118a8103286d84
* **ap-south-1** : ami-02dfe86baa48e096f
* **eu-west-3** : ami-0cc1f01267d2a780d
* **eu-west-2** : ami-0e4c8332e5ce09585
* **eu-west-1** : ami-03459a2f0b0a3b1ab
* **ap-northeast-2** : ami-08f6c2aed3a53f24c
* **ap-northeast-1** : ami-0b798eab95a7c5435
* **sa-east-1** : ami-0d3ee166c09f0d1b2
* **ca-central-1** : ami-00a758c56bd63acd5
* **ap-southeast-1** : ami-0be64d4988cd03fbb
* **ap-southeast-2** : ami-02087310d43a63f31
* **eu-central-1** : ami-097bbefeac0c74225
* **us-east-2** : ami-07eda256712b90f4d
* **us-west-1** : ami-02ef2b55cbd01c7df
* **us-west-2** : ami-037c6176ef4735360
* **us-east-1** : ami-08715c20c0e3f1c15
### v0.12.0
* **eu-north-1** : ami-0ebb4bb8637d0da65
* **ap-south-1** : ami-0fb3c89eb8a8fc294
* **eu-west-3** : ami-0b55ea4a6698d5875
* **eu-west-2** : ami-02979b6d77856b842
* **eu-west-1** : ami-07f4c17a636489574
* **ap-northeast-2** : ami-06071092427dd5ab4
* **ap-northeast-1** : ami-0fbacddfc0e8d2651
* **sa-east-1** : ami-073590d3b3e6f4cfd
* **ca-central-1** : ami-0839610fc0101e41c
* **ap-southeast-1** : ami-0ff0adeef7f9fa879
* **ap-southeast-2** : ami-03ed15d31bfc2844c
* **eu-central-1** : ami-0813c06d8b2462c39
* **us-east-2** : ami-07c593425f988b054
* **us-west-1** : ami-0eb0e13b1f06c03c0
* **us-west-2** : ami-000568ca142798412
* **us-east-1** : ami-062d9da44f96c8a87
* **eu-north-1** : ami-03ff8ab48cd43e77e
* **ap-south-1** : ami-079c1a41ff836487c
* **eu-west-3** : ami-0121ef0398ae87ab0
* **eu-west-2** : ami-09f0f97654d8c79de
* **eu-west-1** : ami-0b7ba303f757bfcd9
* **ap-northeast-2** : ami-053f416517b5f40a6
* **ap-northeast-1** : ami-056dff06c698c2d9d
* **sa-east-1** : ami-017ab655119258639
* **ca-central-1** : ami-03bf5fa1d86ac97f6
* **ap-southeast-1** : ami-0e667958002b0360c
* **ap-southeast-2** : ami-091f1b69cb43b1933
* **eu-central-1** : ami-068ec2f0e98c26541
* **us-east-2** : ami-0524bbdc1b64ff83f
* **us-west-1** : ami-0b4facd7534e393c9
* **us-west-2** : ami-0018d5a7e58966848
* **us-east-1** : ami-08f24178fc14a84d2
### v0.11.0
* **eu-north-1** : ami-0cbe338f058018c97

9
docs/services.conf Normal file
View File

@@ -0,0 +1,9 @@
tasks {
non_responsive_tasks_watchdog {
# In-progress tasks that haven't been updated for at least 'value' seconds will be stopped by the watchdog
threshold_sec: 7200
# Watchdog will sleep for this number of seconds after each cycle
watch_interval_sec: 900
}
}

View File

@@ -0,0 +1,14 @@
from jsonmodels.fields import BoolField, DateTimeField, StringField
from jsonmodels.models import Base
class ReportStatsOptionRequest(Base):
enabled = BoolField(default=None, nullable=True)
class ReportStatsOptionResponse(Base):
supported = BoolField(default=True)
enabled = BoolField()
enabled_time = DateTimeField(nullable=True)
enabled_version = StringField(nullable=True)
enabled_user = StringField(nullable=True)

View File

@@ -161,7 +161,7 @@ class QueueMetrics:
In case no queue ids are specified the avg across all the
company queues is calculated for each metric
"""
# self._log_current_metrics(company_id, queue_ids=queue_ids)
# self._log_current_metrics(company, queue_ids=queue_ids)
if from_date >= to_date:
raise bad_request.FieldsValueError("from_date must be less than to_date")

View File

@@ -0,0 +1,87 @@
from datetime import datetime
import operator
from threading import Thread, Lock
from time import sleep
import attr
import psutil
class ResourceMonitor(Thread):
@attr.s(auto_attribs=True)
class Sample:
cpu_usage: float = 0.0
mem_used_gb: float = 0
mem_free_gb: float = 0
@classmethod
def _apply(cls, op, *samples):
return cls(
**{
field: op(*(getattr(sample, field) for sample in samples))
for field in attr.fields_dict(cls)
}
)
def min(self, sample):
return self._apply(min, self, sample)
def max(self, sample):
return self._apply(max, self, sample)
def avg(self, sample, count):
res = self._apply(lambda x: x * count, self)
res = self._apply(operator.add, res, sample)
res = self._apply(lambda x: x / (count + 1), res)
return res
def __init__(self, sample_interval_sec=5):
super(ResourceMonitor, self).__init__(daemon=True)
self.sample_interval_sec = sample_interval_sec
self._lock = Lock()
self._clear()
def _clear(self):
sample = self._get_sample()
self._avg = sample
self._min = sample
self._max = sample
self._clear_time = datetime.utcnow()
self._count = 1
@classmethod
def _get_sample(cls) -> Sample:
return cls.Sample(
cpu_usage=psutil.cpu_percent(),
mem_used_gb=psutil.virtual_memory().used / (1024 ** 3),
mem_free_gb=psutil.virtual_memory().free / (1024 ** 3),
)
def run(self):
while True:
sample = self._get_sample()
with self._lock:
self._min = self._min.min(sample)
self._max = self._max.max(sample)
self._avg = self._avg.avg(sample, self._count)
self._count += 1
sleep(self.sample_interval_sec)
def get_stats(self) -> dict:
""" Returns current resource statistics and clears internal resource statistics """
with self._lock:
min_ = attr.asdict(self._min)
max_ = attr.asdict(self._max)
avg = attr.asdict(self._avg)
res = {
"interval_sec": (datetime.utcnow() - self._clear_time).total_seconds(),
"num_cores": psutil.cpu_count(),
**{
k: {"min": v, "max": max_[k], "avg": avg[k]}
for k, v in min_.items()
}
}
self._clear()
return res

View File

@@ -0,0 +1,306 @@
import logging
import queue
import random
import time
from datetime import timedelta, datetime
from time import sleep
from typing import Sequence, Optional
import dpath
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from bll.query import Builder as QueryBuilder
from bll.util import get_server_uuid
from bll.workers import WorkerStats, WorkerBLL
from config import config
from config.info import get_deployment_type
from database.model import Company, User
from database.model.queue import Queue
from database.model.task.task import Task
from utilities import safe_get
from utilities.json import dumps
from utilities.threads_manager import ThreadsManager
from version import __version__ as current_version
from .resource_monitor import ResourceMonitor
log = config.logger(__file__)
worker_bll = WorkerBLL()
class StatisticsReporter:
threads = ThreadsManager("Statistics", resource_monitor=ResourceMonitor)
send_queue = queue.Queue()
supported = config.get("apiserver.statistics.supported", True)
@classmethod
def start(cls):
cls.start_sender()
cls.start_reporter()
@classmethod
@threads.register("reporter", daemon=True)
def start_reporter(cls):
"""
Periodically send statistics reports for companies who have opted in.
Note: in trains we usually have only a single company
"""
if not cls.supported:
return
report_interval = timedelta(
hours=config.get("apiserver.statistics.report_interval_hours", 24)
)
while True:
sleep(report_interval.total_seconds())
try:
for company in Company.objects(
defaults__stats_option__enabled=True
).only("id"):
stats = cls.get_statistics(company.id)
cls.send_queue.put(stats)
except Exception as ex:
log.exception(f"Failed collecting stats: {str(ex)}")
@classmethod
@threads.register("sender", daemon=True)
def start_sender(cls):
if not cls.supported:
return
url = config.get("apiserver.statistics.url")
retries = config.get("apiserver.statistics.max_retries", 5)
max_backoff = config.get("apiserver.statistics.max_backoff_sec", 5)
session = requests.Session()
adapter = HTTPAdapter(max_retries=Retry(retries))
session.mount("http://", adapter)
session.mount("https://", adapter)
session.headers["Content-type"] = "application/json"
WarningFilter.attach()
while True:
try:
report = cls.send_queue.get()
# Set a random backoff factor each time we send a report
adapter.max_retries.backoff_factor = random.random() * max_backoff
session.post(url, data=dumps(report))
except Exception as ex:
pass
@classmethod
def get_statistics(cls, company_id: str) -> dict:
"""
Returns a statistics report per company
"""
return {
"time": datetime.utcnow(),
"company_id": company_id,
"server": {
"version": current_version,
"deployment": get_deployment_type(),
"uuid": get_server_uuid(),
"queues": {"count": Queue.objects(company=company_id).count()},
"users": {"count": User.objects(company=company_id).count()},
"resources": cls.threads.resource_monitor.get_stats(),
"experiments": next(
iter(cls._get_experiments_stats(company_id).values()), {}
),
},
"agents": cls._get_agents_statistics(company_id),
}
@classmethod
def _get_agents_statistics(cls, company_id: str) -> Sequence[dict]:
result = cls._get_resource_stats_per_agent(company_id, key="resources")
dpath.merge(
result, cls._get_experiments_stats_per_agent(company_id, key="experiments")
)
return [{"uuid": agent_id, **data} for agent_id, data in result.items()]
@classmethod
def _get_resource_stats_per_agent(cls, company_id: str, key: str) -> dict:
agent_resource_threshold_sec = timedelta(
hours=config.get("apiserver.statistics.report_interval_hours", 24)
).total_seconds()
to_timestamp = int(time.time())
from_timestamp = to_timestamp - int(agent_resource_threshold_sec)
es_req = {
"size": 0,
"query": QueryBuilder.dates_range(from_timestamp, to_timestamp),
"aggs": {
"workers": {
"terms": {"field": "worker"},
"aggs": {
"categories": {
"terms": {"field": "category"},
"aggs": {"count": {"cardinality": {"field": "variant"}}},
},
"metrics": {
"terms": {"field": "metric"},
"aggs": {
"min": {"min": {"field": "value"}},
"max": {"max": {"field": "value"}},
"avg": {"avg": {"field": "value"}},
},
},
},
}
},
}
res = cls._run_worker_stats_query(company_id, es_req)
def _get_cardinality_fields(categories: Sequence[dict]) -> dict:
names = {"cpu": "num_cores"}
return {
names[c["key"]]: safe_get(c, "count/value")
for c in categories
if c["key"] in names
}
def _get_metric_fields(metrics: Sequence[dict]) -> dict:
names = {
"cpu_usage": "cpu_usage",
"memory_used": "mem_used_gb",
"memory_free": "mem_free_gb",
}
return {
names[m["key"]]: {
"min": safe_get(m, "min/value"),
"max": safe_get(m, "max/value"),
"avg": safe_get(m, "avg/value"),
}
for m in metrics
if m["key"] in names
}
buckets = safe_get(res, "aggregations/workers/buckets", default=[])
return {
b["key"]: {
key: {
"interval_sec": agent_resource_threshold_sec,
**_get_cardinality_fields(safe_get(b, "categories/buckets", [])),
**_get_metric_fields(safe_get(b, "metrics/buckets", [])),
}
}
for b in buckets
}
@classmethod
def _get_experiments_stats_per_agent(cls, company_id: str, key: str) -> dict:
agent_relevant_threshold = timedelta(
days=config.get("apiserver.statistics.agent_relevant_threshold_days", 30)
)
to_timestamp = int(time.time())
from_timestamp = to_timestamp - int(agent_relevant_threshold.total_seconds())
workers = cls._get_active_workers(company_id, from_timestamp, to_timestamp)
if not workers:
return {}
stats = cls._get_experiments_stats(company_id, list(workers.keys()))
return {
worker_id: {key: {**workers[worker_id], **stat}}
for worker_id, stat in stats.items()
}
@classmethod
def _get_active_workers(
cls, company_id, from_timestamp: int, to_timestamp: int
) -> dict:
es_req = {
"size": 0,
"query": QueryBuilder.dates_range(from_timestamp, to_timestamp),
"aggs": {
"workers": {
"terms": {"field": "worker"},
"aggs": {"last_activity_time": {"max": {"field": "timestamp"}}},
}
},
}
res = cls._run_worker_stats_query(company_id, es_req)
buckets = safe_get(res, "aggregations/workers/buckets", default=[])
return {
b["key"]: {"last_activity_time": b["last_activity_time"]["value"]}
for b in buckets
}
@classmethod
def _run_worker_stats_query(cls, company_id, es_req) -> dict:
return worker_bll.es_client.search(
index=f"{WorkerStats.worker_stats_prefix_for_company(company_id)}*",
doc_type="stat",
body=es_req,
)
@classmethod
def _get_experiments_stats(
cls, company_id, workers: Optional[Sequence] = None
) -> dict:
pipeline = [
{
"$match": {
"company": company_id,
"started": {"$exists": True, "$ne": None},
"last_update": {"$exists": True, "$ne": None},
"status": {"$nin": ["created", "queued"]},
**({"last_worker": {"$in": workers}} if workers else {}),
}
},
{
"$group": {
"_id": "$last_worker" if workers else None,
"count": {"$sum": 1},
"avg_run_time_sec": {
"$avg": {
"$divide": [
{"$subtract": ["$last_update", "$started"]},
1000,
]
}
},
"avg_iterations": {"$avg": "$last_iteration"},
}
},
{
"$project": {
"count": 1,
"avg_run_time_sec": {"$trunc": "$avg_run_time_sec"},
"avg_iterations": {"$trunc": "$avg_iterations"},
}
},
]
return {
group["_id"]: {k: v for k, v in group.items() if k != "_id"}
for group in Task.aggregate(*pipeline)
}
class WarningFilter(logging.Filter):
@classmethod
def attach(cls):
from urllib3.connectionpool import (
ConnectionPool,
) # required to make sure the logger is created
assert ConnectionPool # make sure import is not optimized out
logging.getLogger("urllib3.connectionpool").addFilter(cls())
def filter(self, record):
if (
record.levelno == logging.WARNING
and len(record.args) > 2
and record.args[2] == "/stats"
):
return False
return True

View File

@@ -29,7 +29,7 @@ from .utils import ChangeStatusRequest, validate_status_change
class TaskBLL(object):
threads = ThreadsManager()
threads = ThreadsManager("TaskBLL")
def __init__(self, events_es=None):
self.events_es = (

View File

@@ -1,7 +1,9 @@
import functools
from operator import itemgetter
from typing import Sequence, Optional, Callable, Tuple, Dict, Any, Set
from database.model import AttributedDocument
from database.model.settings import Settings
def extract_properties_to_lists(
@@ -64,3 +66,8 @@ class SetFieldsResolver:
in the format suitable for projection (dot separated)
"""
return set(name.replace("__", ".") for name in self.fields.values())
@functools.lru_cache()
def get_server_uuid() -> Optional[str]:
return Settings.get_by_key("server.uuid")

View File

@@ -4,6 +4,7 @@ from typing import Sequence, Set, Optional
import attr
import elasticsearch.helpers
import es_factory
from apierrors import APIError
from apierrors.errors import bad_request, server_error
@@ -22,10 +23,9 @@ from database.model.auth import User
from database.model.company import Company
from database.model.queue import Queue
from database.model.task.task import Task
from service_repo.redis_manager import redman
from redis_manager import redman
from timing_context import TimingContext
from tools import safe_get
from .stats import WorkerStats
log = config.logger(__file__)
@@ -33,9 +33,9 @@ log = config.logger(__file__)
class WorkerBLL:
def __init__(self, es=None, redis=None):
self.es = es if es is not None else es_factory.connect("workers")
self.es_client = es if es is not None else es_factory.connect("workers")
self.redis = redis if redis is not None else redman.connection("workers")
self._stats = WorkerStats(self.es)
self._stats = WorkerStats(self.es_client)
@property
def stats(self) -> WorkerStats:
@@ -396,7 +396,7 @@ class WorkerBLL:
for i, val in enumerate(value)
)
es_res = elasticsearch.helpers.bulk(self.es, actions)
es_res = elasticsearch.helpers.bulk(self.es_client, actions)
added, errors = es_res[:2]
return (added == len(actions)) and not errors

View File

@@ -101,4 +101,18 @@
# GET request timeout
request_timeout_sec: 3.0
}
statistics {
# Note: statistics are sent ONLY if the user has actively opted-in
supported: true
url: "https://updates.trains.allegro.ai/stats"
report_interval_hours: 24
agent_relevant_threshold_days: 30
max_retries: 5
max_backoff_sec: 5
}
}

View File

@@ -1,5 +1,6 @@
from functools import lru_cache
from pathlib import Path
from os import getenv
root = Path(__file__).parent.parent
@@ -26,3 +27,17 @@ def get_commit_number():
return (root / "COMMIT").read_text().strip()
except FileNotFoundError:
return ""
@lru_cache()
def get_deployment_type() -> str:
value = getenv("TRAINS_SERVER_DEPLOYMENT_TYPE")
if value:
return value
try:
value = (root / "DEPLOY").read_text().strip()
except FileNotFoundError:
pass
return value or "manual"

View File

@@ -1,5 +1,6 @@
from os import getenv
from boltons.iterutils import first
from furl import furl
from jsonmodels import models
from jsonmodels.errors import ValidationError
@@ -11,14 +12,16 @@ from config import config
from .defs import Database
from .utils import get_items
from boltons.iterutils import first
log = config.logger("database")
strict = config.get("apiserver.mongo.strict", True)
OVERRIDE_HOST_ENV_KEY = ("MONGODB_SERVICE_HOST", "MONGODB_SERVICE_SERVICE_HOST")
OVERRIDE_PORT_ENV_KEY = "MONGODB_SERVICE_PORT"
OVERRIDE_HOST_ENV_KEY = (
"TRAINS_MONGODB_SERVICE_HOST",
"MONGODB_SERVICE_HOST",
"MONGODB_SERVICE_SERVICE_HOST",
)
OVERRIDE_PORT_ENV_KEY = ("TRAINS_MONGODB_SERVICE_PORT", "MONGODB_SERVICE_PORT")
_entries = []
@@ -41,7 +44,7 @@ def initialize():
if override_hostname:
log.info(f"Using override mongodb host {override_hostname}")
override_port = getenv(OVERRIDE_PORT_ENV_KEY)
override_port = first(map(getenv, OVERRIDE_PORT_ENV_KEY), None)
if override_port:
log.info(f"Using override mongodb port {override_port}")

View File

@@ -1,23 +1,36 @@
from mongoengine import Document, EmbeddedDocument, EmbeddedDocumentField, StringField, Q
from mongoengine import (
Document,
EmbeddedDocument,
EmbeddedDocumentField,
StringField,
Q,
BooleanField,
DateTimeField,
)
from database import Database, strict
from database.fields import StrippedStringField
from database.model import DbModelMixin
class ReportStatsOption(EmbeddedDocument):
enabled = BooleanField(default=False) # opt-in for statistics reporting
enabled_version = StringField() # server version when enabled
enabled_time = DateTimeField() # time when enabled
enabled_user = StringField() # ID of user who enabled
class CompanyDefaults(EmbeddedDocument):
cluster = StringField()
stats_option = EmbeddedDocumentField(ReportStatsOption, default=ReportStatsOption)
class Company(DbModelMixin, Document):
meta = {
'db_alias': Database.backend,
'strict': strict,
}
meta = {"db_alias": Database.backend, "strict": strict}
id = StringField(primary_key=True)
name = StrippedStringField(unique=True, min_length=3)
defaults = EmbeddedDocumentField(CompanyDefaults)
defaults = EmbeddedDocumentField(CompanyDefaults, default=CompanyDefaults)
@classmethod
def _prepare_perm_query(cls, company, allow_public=False):

View File

@@ -0,0 +1,57 @@
from typing import Any, Optional, Sequence, Tuple
from mongoengine import Document, StringField, DynamicField, Q
from mongoengine.errors import NotUniqueError
from database import Database, strict
from database.model import DbModelMixin
class Settings(DbModelMixin, Document):
meta = {
"db_alias": Database.backend,
"strict": strict,
}
key = StringField(primary_key=True)
value = DynamicField()
@classmethod
def get_by_key(cls, key: str, default: Optional[Any] = None, sep: str = ".") -> Any:
key = key.strip(sep)
res = Settings.objects(key=key).first()
if not res:
return default
return res.value
@classmethod
def get_by_prefix(
cls, key_prefix: str, default: Optional[Any] = None, sep: str = "."
) -> Sequence[Tuple[str, Any]]:
key_prefix = key_prefix.strip(sep)
query = Q(key=key_prefix) | Q(key__startswith=key_prefix + sep)
res = Settings.objects(query)
if not res:
return default
return [(x.key, x.value) for x in res]
@classmethod
def set_or_add_value(cls, key: str, value: Any, sep: str = ".") -> bool:
""" Sets a new value or adds a new key/value setting (if key does not exist) """
key = key.strip(sep)
res = Settings.objects(key=key).update(key=key, value=value, upsert=True)
# if Settings.objects(key=key).only("key"):
#
# else:
# res = Settings(key=key, value=value).save()
return bool(res)
@classmethod
def add_value(cls, key: str, value: Any, sep: str = ".") -> bool:
""" Adds a new key/value settings. Fails if key already exists. """
key = key.strip(sep)
try:
res = Settings(key=key, value=value).save(force_insert=True)
return bool(res)
except NotUniqueError:
return False

View File

@@ -1,20 +1,25 @@
from datetime import datetime
from os import getenv
from boltons.iterutils import first
from elasticsearch import Elasticsearch, Transport
from config import config
log = config.logger(__file__)
OVERRIDE_HOST_ENV_KEY = ("ELASTIC_SERVICE_HOST", "ELASTIC_SERVICE_SERVICE_HOST")
OVERRIDE_PORT_ENV_KEY = "ELASTIC_SERVICE_PORT"
OVERRIDE_HOST_ENV_KEY = (
"TRAINS_ELASTIC_SERVICE_HOST",
"ELASTIC_SERVICE_HOST",
"ELASTIC_SERVICE_SERVICE_HOST",
)
OVERRIDE_PORT_ENV_KEY = ("TRAINS_ELASTIC_SERVICE_PORT", "ELASTIC_SERVICE_PORT")
OVERRIDE_HOST = next(filter(None, map(getenv, OVERRIDE_HOST_ENV_KEY)), None)
OVERRIDE_HOST = first(filter(None, map(getenv, OVERRIDE_HOST_ENV_KEY)))
if OVERRIDE_HOST:
log.info(f"Using override elastic host {OVERRIDE_HOST}")
OVERRIDE_PORT = getenv(OVERRIDE_PORT_ENV_KEY)
OVERRIDE_PORT = first(filter(None, map(getenv, OVERRIDE_PORT_ENV_KEY)))
if OVERRIDE_PORT:
log.info(f"Using override elastic port {OVERRIDE_PORT}")
@@ -25,6 +30,7 @@ class MissingClusterConfiguration(Exception):
"""
Exception when cluster configuration is not found in config files
"""
pass
@@ -32,6 +38,7 @@ class InvalidClusterConfiguration(Exception):
"""
Exception when cluster configuration does not contain required properties
"""
pass
@@ -46,12 +53,14 @@ def connect(cluster_name):
"""
if cluster_name not in _instances:
cluster_config = get_cluster_config(cluster_name)
hosts = cluster_config.get('hosts', None)
hosts = cluster_config.get("hosts", None)
if not hosts:
raise InvalidClusterConfiguration(cluster_name)
args = cluster_config.get('args', {})
_instances[cluster_name] = Elasticsearch(hosts=hosts, transport_class=Transport, **args)
args = cluster_config.get("args", {})
_instances[cluster_name] = Elasticsearch(
hosts=hosts, transport_class=Transport, **args
)
return _instances[cluster_name]
@@ -63,13 +72,13 @@ def get_cluster_config(cluster_name):
:return: config section for the cluster
:raises MissingClusterConfiguration: in case no config section is found for the cluster
"""
cluster_key = '.'.join(('hosts.elastic', cluster_name))
cluster_key = ".".join(("hosts.elastic", cluster_name))
cluster_config = config.get(cluster_key, None)
if not cluster_config:
raise MissingClusterConfiguration(cluster_name)
def set_host_prop(key, value):
for host in cluster_config.get('hosts', []):
for host in cluster_config.get("hosts", []):
host[key] = value
if OVERRIDE_HOST:

View File

@@ -1,6 +1,7 @@
import importlib.util
from datetime import datetime
from pathlib import Path
from uuid import uuid4
import attr
from furl import furl
@@ -15,6 +16,7 @@ from database.model.auth import Role
from database.model.auth import User as AuthUser, Credentials
from database.model.company import Company
from database.model.queue import Queue
from database.model.settings import Settings
from database.model.user import User
from database.model.version import Version as DatabaseVersion
from elastic.apply_mappings import apply_mappings_to_host
@@ -109,10 +111,7 @@ def _ensure_user(user: FixedUser, company_id: str):
data["email"] = f"{user.user_id}@example.com"
data["role"] = Role.user
_ensure_auth_user(
user_data=data,
company_id=company_id,
)
_ensure_auth_user(user_data=data, company_id=company_id)
given_name, _, family_name = user.name.partition(" ")
@@ -142,9 +141,7 @@ def _apply_migrations():
try:
new_scripts = {
ver: path
for ver, path in (
(Version(f.stem), f) for f in migration_dir.glob("*.py")
)
for ver, path in ((Version(f.stem), f) for f in migration_dir.glob("*.py"))
if ver > last_version
}
except ValueError as ex:
@@ -179,16 +176,30 @@ def _apply_migrations():
).save()
def _ensure_uuid():
Settings.add_value("server.uuid", str(uuid4()))
def init_mongo_data():
try:
_apply_migrations()
_ensure_uuid()
company_id = _ensure_company()
_ensure_default_queue(company_id)
users = [
{"name": "apiserver", "role": Role.system, "email": "apiserver@example.com"},
{"name": "webserver", "role": Role.system, "email": "webserver@example.com"},
{
"name": "apiserver",
"role": Role.system,
"email": "apiserver@example.com",
},
{
"name": "webserver",
"role": Role.system,
"email": "webserver@example.com",
},
{"name": "tests", "role": Role.user, "email": "tests@example.com"},
]

View File

@@ -2,21 +2,23 @@ import threading
from os import getenv
from time import sleep
from apierrors.errors.server_error import ConfigError, GeneralError
from config import config
from boltons.iterutils import first
from redis import StrictRedis
from redis.sentinel import Sentinel, SentinelConnectionPool
from apierrors.errors.server_error import ConfigError, GeneralError
from config import config
log = config.logger(__file__)
OVERRIDE_HOST_ENV_KEY = "REDIS_SERVICE_HOST"
OVERRIDE_PORT_ENV_KEY = "REDIS_SERVICE_PORT"
OVERRIDE_HOST_ENV_KEY = ("TRAINS_REDIS_SERVICE_HOST", "REDIS_SERVICE_HOST")
OVERRIDE_PORT_ENV_KEY = ("TRAINS_REDIS_SERVICE_PORT", "REDIS_SERVICE_PORT")
OVERRIDE_HOST = getenv(OVERRIDE_HOST_ENV_KEY)
OVERRIDE_HOST = first(filter(None, map(getenv, OVERRIDE_HOST_ENV_KEY)))
if OVERRIDE_HOST:
log.info(f"Using override redis host {OVERRIDE_HOST}")
OVERRIDE_PORT = getenv(OVERRIDE_PORT_ENV_KEY)
OVERRIDE_PORT = first(filter(None, map(getenv, OVERRIDE_PORT_ENV_KEY)))
if OVERRIDE_PORT:
log.info(f"Using override redis port {OVERRIDE_PORT}")

View File

@@ -28,3 +28,4 @@ semantic_version>=2.6.0,<3
furl>=2.0.0
redis>=2.10.5
humanfriendly==4.18
psutil>=5.6.5

View File

@@ -3,6 +3,25 @@ _default {
internal: true
allow_roles: ["root", "system"]
}
get_stats {
"2.1" {
description: "Get the server collected statistics."
request {
type: object
properties {
interval {
description: "The period for statistics collection in seconds."
type: long
}
}
}
response {
type: object
properties: {
}
}
}
}
config {
"2.1" {
description: "Get server configuration. Secure section is not returned."
@@ -66,3 +85,39 @@ endpoints {
}
}
}
report_stats_option {
"2.4" {
description: "Get or set the report statistics option per-company"
request {
type: object
properties {
enabled {
description: "If provided, sets the report statistics option (true/false)"
type: boolean
}
}
}
response {
type: object
properties {
enabled {
description: "Returns the current report stats option value"
type: boolean
}
enabled_time {
description: "If enabled, returns the time at which option was enabled"
type: string
format: date-time
}
enabled_version {
description: "If enabled, returns the server version at the time option was enabled"
type: string
}
enabled_user {
description: "If enabled, returns Id of the user who enabled the option"
type: string
}
}
}
}
}

View File

@@ -7,15 +7,15 @@ from werkzeug.exceptions import BadRequest
import database
from apierrors.base import BaseError
from bll.statistics.stats_reporter import StatisticsReporter
from config import config
from init_data import init_es_data, init_mongo_data
from service_repo import ServiceRepo, APICall
from service_repo.auth import AuthType
from service_repo.errors import PathParsingError
from timing_context import TimingContext
from utilities import json
from init_data import init_es_data, init_mongo_data
from updates import check_updates_thread
from utilities import json
app = Flask(__name__, static_url_path="/static")
CORS(app, **config.get("apiserver.cors"))
@@ -38,6 +38,7 @@ log.info(f"Exposed Services: {' '.join(ServiceRepo.endpoint_names())}")
check_updates_thread.start()
StatisticsReporter.start()
@app.before_first_request
@@ -57,7 +58,9 @@ def before_request():
content, content_type = ServiceRepo.handle_call(call)
headers = {}
if call.result.filename:
headers["Content-Disposition"] = f"attachment; filename={call.result.filename}"
headers[
"Content-Disposition"
] = f"attachment; filename={call.result.filename}"
if call.result.headers:
headers.update(call.result.headers)
@@ -71,7 +74,9 @@ def before_request():
if value is None:
response.set_cookie(key, "", expires=0)
else:
response.set_cookie(key, value, **config.get("apiserver.auth.cookies"))
response.set_cookie(
key, value, **config.get("apiserver.auth.cookies")
)
return response
except Exception as ex:

View File

@@ -255,6 +255,7 @@ class MissingIdentity(Exception):
class APICall(DataContainer):
HEADER_AUTHORIZATION = "Authorization"
HEADER_REAL_IP = "X-Real-IP"
HEADER_FORWARDED_FOR = "X-Forwarded-For"
""" Standard headers """
_transaction_headers = ("X-Trains-Trx",)
@@ -382,8 +383,13 @@ class APICall(DataContainer):
@property
def real_ip(self):
real_ip = self.get_header(self.HEADER_REAL_IP)
return real_ip or self._remote_addr or "untrackable"
""" Obtain visitor's IP address """
return (
self.get_header(self.HEADER_FORWARDED_FOR)
or self.get_header(self.HEADER_REAL_IP)
or self._remote_addr
or "untrackable"
)
@property
def failed(self):

View File

@@ -394,7 +394,7 @@ def get_task_plots_v1_7(call, company_id, req_model):
task_bll.assert_exists(call.identity.company, task_id, allow_public=True)
# events, next_scroll_id, total_events = event_bll.get_task_events(
# company_id, task_id,
# company, task_id,
# event_type="plot",
# sort=[{"iter": {"order": "desc"}}],
# last_iter_count=iters,
@@ -453,7 +453,7 @@ def get_debug_images_v1_7(call, company_id, req_model):
task_bll.assert_exists(call.identity.company, task_id, allow_public=True)
# events, next_scroll_id, total_events = event_bll.get_task_events(
# company_id, task_id,
# company, task_id,
# event_type="training_debug_image",
# sort=[{"iter": {"order": "desc"}}],
# last_iter_count=iters,

View File

@@ -1,8 +1,24 @@
from datetime import datetime
from pyhocon.config_tree import NoneValue
from apierrors import errors
from apimodels.server import ReportStatsOptionRequest, ReportStatsOptionResponse
from bll.statistics.stats_reporter import StatisticsReporter
from config import config
from config.info import get_version, get_build_number, get_commit_number
from database.errors import translate_errors_context
from database.model import Company
from database.model.company import ReportStatsOption
from service_repo import ServiceRepo, APICall, endpoint
from version import __version__ as current_version
@endpoint("server.get_stats")
def get_stats(call: APICall):
call.result.data = StatisticsReporter.get_statistics(
company_id=call.identity.company
)
@endpoint("server.config")
@@ -43,3 +59,35 @@ def info(call: APICall):
"build": get_build_number(),
"commit": get_commit_number(),
}
@endpoint(
"server.report_stats_option",
request_data_model=ReportStatsOptionRequest,
response_data_model=ReportStatsOptionResponse,
)
def report_stats(call: APICall, company: str, request: ReportStatsOptionRequest):
if not StatisticsReporter.supported:
result = ReportStatsOptionResponse(supported=False)
else:
enabled = request.enabled
with translate_errors_context():
query = Company.objects(id=company)
if enabled is None:
stats_option = query.first().defaults.stats_option
else:
stats_option = ReportStatsOption(
enabled=enabled,
enabled_time=datetime.utcnow(),
enabled_version=current_version,
enabled_user=call.identity.user,
)
updated = query.update(defaults__stats_option=stats_option)
if not updated:
raise errors.server_error.InternalError(
f"Failed setting report_stats to {enabled}"
)
result = ReportStatsOptionResponse(**stats_option.to_mongo())
call.result.data_model = result

View File

@@ -108,7 +108,7 @@ class TestWorkersService(TestService):
from_date = to_date - timedelta(days=1)
# no variants
res = self.api.workers.get_stats(
res = self.api.workers.get_statistics(
items=[
dict(key="cpu_usage", aggregation="avg"),
dict(key="cpu_usage", aggregation="max"),
@@ -142,7 +142,7 @@ class TestWorkersService(TestService):
)
# split by variants
res = self.api.workers.get_stats(
res = self.api.workers.get_statistics(
items=[dict(key="cpu_usage", aggregation="avg")],
from_date=from_date.timestamp(),
to_date=to_date.timestamp(),
@@ -165,7 +165,7 @@ class TestWorkersService(TestService):
assert all(_check_metric_and_variants(worker) for worker in res["workers"])
res = self.api.workers.get_stats(
res = self.api.workers.get_statistics(
items=[dict(key="cpu_usage", aggregation="avg")],
from_date=from_date.timestamp(),
to_date=to_date.timestamp(),
@@ -183,12 +183,13 @@ class TestWorkersService(TestService):
# run on an empty es db since we have no way
# to pass non existing workers to this api
# res = self.api.workers.get_activity_report(
# from_date=from_date.timestamp(),
# to_date=to_date.timestamp(),
# from_timestamp=from_timestamp.timestamp(),
# to_timestamp=to_timestamp.timestamp(),
# interval=20,
# )
self._simulate_workers()
to_date = utc_now_tz_aware()
from_date = to_date - timedelta(minutes=10)

View File

@@ -8,6 +8,7 @@ import requests
from semantic_version import Version
from config import config
from database.model.settings import Settings
from version import __version__ as current_version
log = config.logger(__name__)
@@ -39,12 +40,15 @@ class CheckUpdatesThread(Thread):
def _check_new_version_available(self) -> Optional[_VersionResponse]:
url = config.get(
"apiserver.check_for_updates.url", "https://updates.trains.allegro.ai/updates"
"apiserver.check_for_updates.url",
"https://updates.trains.allegro.ai/updates",
)
uid = Settings.get_by_key("server.uuid")
response = requests.get(
url,
json={"versions": {self.component_name: str(current_version)}},
json={"versions": {self.component_name: str(current_version)}, "uid": uid},
timeout=float(
config.get("apiserver.check_for_updates.request_timeout_sec", 3.0)
),

View File

@@ -1,14 +1,29 @@
from functools import wraps
from threading import Lock, Thread
import attr
@attr.s(auto_attribs=True)
class ThreadsManager:
objects = {}
lock = Lock()
def __init__(self, name=None, **threads):
super(ThreadsManager, self).__init__()
self.name = name or self.__class__.name
self.objects = {}
self.lock = Lock()
for name, thread in threads.items():
if issubclass(thread, Thread):
thread = thread()
thread.start()
elif isinstance(thread, Thread):
if not thread.is_alive():
thread.start()
else:
raise Exception(f"Expected thread or thread class ({name}): {thread}")
self.objects[name] = thread
def register(self, thread_name, daemon=True):
def decorator(f):
@wraps(f)
@@ -17,7 +32,7 @@ class ThreadsManager:
thread = self.objects.get(thread_name)
if not thread:
thread = Thread(
target=f, name=thread_name, args=args, kwargs=kwargs
target=f, name=f"{self.name}_{thread_name}", args=args, kwargs=kwargs
)
thread.daemon = daemon
thread.start()
@@ -27,3 +42,13 @@ class ThreadsManager:
return wrapper
return decorator
def __getattr__(self, item):
if item in self.objects:
return self.objects[item]
return self.__getattribute__(item)
def __getitem__(self, item):
if item in self.objects:
return self.objects[item]
raise KeyError(item)