clearml-docs/docs/deploying_clearml/enterprise_deploy/on_prem_ubuntu.md

351 lines
12 KiB
Markdown
Raw Normal View History

---
title: On-Premises on Ubuntu
---
This guide provides step-by-step instruction for installing the ClearML Enterprise Server on a single Linux Ubuntu server.
## Prerequisites
The following are required for the ClearML on-premises server:
- At least 8 CPUs
- At least 32 GB RAM
- OS - Ubuntu 20 or higher
- 4 Disks
- Root
- For storing the system and dockers
- Recommended at least 30 GB
- mounted to `/`
- Docker
- For storing Docker data
- Recommended at least 80GB
- mounted to `/var/lib/docker` with permissions 710
- Data
- For storing Elastic and Mongo databases
- Size depends on the usage. Recommended not to start with less than 100 GB
- Mounted to `/opt/allegro/data`
- File Server
- For storing `fileserver` files (models and debug samples)
- Size depends on usage
- Mounted to `/opt/allegro/data/fileserver`
- User for running ClearML services with administrator privileges
- Ports 8080, 8081, and 8008 available for the ClearML Server services
In addition, make sure you have the following (provided by ClearML):
- Docker hub credentials to access the ClearML images
- `docker-compose.yml` - The main compose file containing the services definitions
- `docker-compose.override.yml` - The override file containing additions that are server specific, such as SSO integration
- `constants.env` - The `env` file contains values of items in the `docker-compose` that are unique for
a specific environment, such as keys and secrets for system users, credentials, and image versions. The constant file
should be reviewed and modified prior to the server installation
## Installing ClearML Server
### Preliminary Steps
1. Install Docker CE
```
https://docs.docker.com/install/linux/docker-ce/ubuntu/
```
1. Verify the Docker CE installation:
```
docker run hello-world
```
Expected output:
```
Hello from Docker!
This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub. (amd64)
3. The Docker daemon created a new container from that image which runs the executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it to your terminal.
```
1. Install `docker-compose`:
```
sudo curl -L "https://github.com/docker/compose/releases/download/1.24.1/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
```
:::note
You might need to downgrade urlib3 by running `sudo pip3 install urllib3==1.26.2`
:::
1. Increase `vm.max_map_count` for Elasticsearch in Docker:
```
echo "vm.max_map_count=262144" > /tmp/99-allegro.conf
echo "vm.overcommit_memory=1" >> /tmp/99-allegro.conf
echo "fs.inotify.max_user_instances=256" >> /tmp/99-allegro.conf
sudo mv /tmp/99-allegro.conf /etc/sysctl.d/99-allegro.conf
sudo sysctl -w vm.max_map_count=262144
sudo service docker restart
```
1. Disable THP. Create the `/etc/systemd/system/disable-thp.service` service file with the following content:
:::important
The `ExecStart` string (Under `[Service]) should be a single line.
:::
```
[Unit]
Description=Disable Transparent Huge Pages (THP)
[Service]
Type=simple
ExecStart=/bin/sh -c "echo 'never' > /sys/kernel/mm/transparent_hugepage/enabled && echo 'never' > /sys/kernel/mm/transparent_hugepage/defrag"
[Install]
WantedBy=multi-user.target
```
1. Enable the online service:
```
sudo systemctl daemon-reload
sudo systemctl enable disable-thp
```
1. Restart the machine
### Installing the Server
1. Remove any previous installation of ClearML Server
```
sudo rm -R /opt/clearml/
sudo rm -R /opt/allegro/
```
1. Create local directories for the databases and storage:
```
sudo mkdir -pv /opt/allegro/data/elastic7plus
sudo chown 1000:1000 /opt/allegro/data/elastic7plus
sudo mkdir -pv /opt/allegro/data/mongo_4/configdb
sudo mkdir -pv /opt/allegro/data/mongo_4/db
sudo mkdir -pv /opt/allegro/data/redis
sudo mkdir -pv /opt/allegro/logs/apiserver
sudo mkdir -pv /opt/allegro/documentation
sudo mkdir -pv /opt/allegro/data/fileserver
sudo mkdir -pv /opt/allegro/logs/fileserver
sudo mkdir -pv /opt/allegro/logs/fileserver-proxy
sudo mkdir -pv /opt/allegro/data/fluentd/buffer
sudo mkdir -pv /opt/allegro/config/webserver_external_files
sudo mkdir -pv /opt/allegro/config/onprem_poc
```
1. Copy the following ClearML configuration files to `/opt/allegro`
* `constants.env`
* `docker-compose.override.yml`
* `docker-compose.yml`
1. Create an initial ClearML configuration file `/opt/allegro/config/onprem_poc/apiserver.conf` with a fixed user:
```
auth {
fixed_users {
enabled: true,
users: [
{username: "support", password: "<enter password here>", admin: true, name: "allegro.ai Support User"},
]
}
}
```
1. Log into the Docker Hub repository using the username and password provided by ClearML:
```
sudo docker login -u=$DOCKERHUB_USER -p=$DOCKERHUB_PASSWORD
```
1. Start the `docker-compose` by changing directories to the directory containing the docker-compose files and running the following command:
sudo docker-compose --env-file constants.env up -d
1. Verify web access by browsing to your URL (IP address) and port 8080.
```
http://<server_ip_here>:8080
```
## Security
To ensure the server's security, it's crucial to open only the necessary ports.
### Working with HTTP
Directly accessing the server using `HTTP` is not recommended. However, if you choose to do so, only the following ports
should be open to any location where a ClearML client (`clearml-agent`, SDK, or web browser) may operate:
* Port 8080 for accessing the WebApp
* Port 8008 for accessing the API server
* Port 8081 for accessing the file server
### Working with TLS / HTTPS
TLS termination through an external mechanism, such as a load balancer, is supported and recommended. For such a setup,
the following subdomains should be forwarded to the corresponding ports on the server:
* `https://api.<domain>` should be forwarded to port 8008
* `https://app.<domain>` should be forwarded to port 8080
* `https://files.<domain>` should be forwarded to port 8081
**Critical: Ensure no other ports are open to maintain the highest level of security.**
Additionally, ensure that the following URLs are correctly configured in the server's environment file:
```
WEBSERVER_URL_FOR_EXTERNAL_WORKERS=https://app.<your-domain>
APISERVER_URL_FOR_EXTERNAL_WORKERS=https://api.<your-domain>
FILESERVER_URL_FOR_EXTERNAL_WORKERS=https://files.<your-domain>
```
:::note
If you prefer to use URLs that do not begin with `app`, `api`, or `files`, you must also add the following configuration
for the web server in your `docker-compose.override.yml` file:
```
webserver:
environment:
- WEBSERVER__displayedServerUrls={"apiServer":"$APISERVER_URL_FOR_EXTERNAL_WORKERS","filesServer":"$FILESERVER_URL_FOR_EXTERNAL_WORKERS"}
```
:::
## Backups
The main components that contain data are the databases:
* MongoDB
* ElasticSearch
* File server
It is recommended to back them periodically.
### Fileserver
It is recommended to back up the entire file server volume.
* Recommended to perform at least a daily backup.
* Recommended backup retention of 2 days at the least.
### ElasticSearch
Please refer to [ElasticSearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-restore.html) for creating snapshots.
#### MongoDB
Please refer to [MongoDBs documentation](https://www.mongodb.com/docs/manual/core/backups/) for backing up / restoring.
## Monitoring
The following monitoring is recommended:
### Basic Hardware Monitoring
#### CPU
CPU usage varies depending on system usage. We recommend to monitor CPU usage and to alert when the usage is higher
than normal. Recommended starting alerts would be 5-minute CPU load
level of 5 and 10, and adjusting according to performance.
#### RAM
Available memory usage also varies depending on system usage. Due to spikes in usage when performing certain tasks, 6-8 GB
of available RAM is recommended as the standard baseline. Some use cases may require more. Thus, we recommend to have 8 GB
of available memory on top of the regular system usage. Alert levels should alert if the available memory is below normal.
##### Disk Usage
There are several disks used by the system. We recommend monitoring all of them. Standard alert levels are 20%, 10% and
5% of free disk space.
### Service Availability
The following services should be monitored periodically for availability and for response time:
* `apiserver` - [http://localhost:10000/api/debug.ping](http://localhost:10000/api/debug.ping) should return HTTP 200
* `webserver` - [http://localhost:10000](http://localhost:10000/) should return HTTP 200
* `fileserver` - [http://localhost:10000/files/](http://localhost:10000/files/) should return HTTP 405 ("method not allowed")
### API Server Docker Memory Usage
A usage spike can happen during normal operation. But very high spikes (above 6GB) are not expected. We recommend using
`docker stats` to get this information.
For example, the following comment retrieves the API server's information from the Docker server:
```
sudo curl -s --unix-socket /var/run/docker.sock http://localhost/containers/allegro-apiserver/stats?stream=false
```
We recommend monitoring the API server memory in addition to the system's available RAM. Alerts should be triggered
when memory usage of the API server exceeds the normal behavior. A starting value can be 6 GB.
### Backup Failures
It is also highly recommended to monitor the backups and to alert if a backup has failed.
## Troubleshooting
In normal operation mode, all services should be up, and a call to `sudo docker ps` should yield the list of services.
If a service fails, it is usually due to one of the following:
* Lack of required resources such as storage or memory
* Incorrect configuration
* Software anomaly
When a service fails, it should automatically restart. However, if the cause of the failure is persistent, the service
will fail again. If a service fails, do the following:
### Check the Log
Run:
```
sudo docker <container name or ID> logs -n 1000
```
See if there is an error message in the log that can explain the failure.
### Check the Server's Environment
The system should be constantly monitored, however it is important to check the following:
* **Storage space**: run `sudo du -hs /`
* **RAM**:
* Run `vmstat -s` to check available RAM
* Run: `top` to check the processes.
:::note
Some operations, such as complex queries, may cause a spike in memory usage. Therefore, it is recommended to have at least 8GB of free RAM available.
:::
* **Network**: Make sure that there is external access to the services
* **CPU**: The best indicator of the need of additional compute resources is high CPU usage of the `apiserver` and `apiserver-es` services.
* Examine the usage of each service using `sudo docker stats`
* If there is a need to add additional CPUs after updating the server, increase the number of workers on the `apiserver`
service by changing the value of `APISERVER_WORKERS_NUMBER` in the `constants.env` file (up to one additional worker per additional core).
### API Server
In case of failures in the `allegro-apiserver` container, or in cases in which the web application gets unexpected errors,
and the browser's developer tools (F12) network tab shows error codes being returned by the server, also check the log
of the `apiserver` which is written to `/opt/allegro/logs/apiserver/apiserver.log`.
Additionally, you can check the server availability using:
```
curl http://localhost:8008/api/debug.ping
```
This should return HTTP 200.
### Web Server
Check the webserver availability by running the following:
```
curl http://<servers IP address>:8080/configuration.json |
```