diff --git a/docs/deploying_clearml/enterprise_deploy/k8s.md b/docs/deploying_clearml/enterprise_deploy/k8s.md index d4052918..c5385fdd 100644 --- a/docs/deploying_clearml/enterprise_deploy/k8s.md +++ b/docs/deploying_clearml/enterprise_deploy/k8s.md @@ -2,408 +2,168 @@ title: Kubernetes --- +This guide provides step-by-step instructions for installing the ClearML Enterprise control-plane setup in a Kubernetes cluster. -This guide provides step-by-step instructions for installing the ClearML Enterprise setup in a Kubernetes cluster. +ClearML Enterprise is the main ClearML Server, comprising the ClearML `apiserver`, `fileserver`, and `webserver` components. +The package also includes MongoDB, ElasticSearch, and Redis as Helm dependencies. ## Prerequisites +To deploy ClearML Enterprise, ensure the following components and configurations are in place: -* A Kubernetes cluster -* An ingress controller (e.g. `nginx-ingress`) and the ability to create LoadBalancer services (e.g. MetalLB) if needed - to expose ClearML -* Credentials for ClearML Enterprise GitHub Helm chart repository -* Credentials for ClearML Enterprise DockerHub repository -* URL for downloading the ClearML Enterprise applications configuration +- Kubernetes Cluster: A vanilla Kubernetes cluster is preferred for optimal GPU support. +- CLI Tools: `kubectl` and `helm` must be installed and configured. +- Ingress Controller: An Ingress controller (e.g., `nginx-ingress`) is required. If exposing services externally, a + LoadBalancer-capable solution (e.g. `MetalLB`) should also be configured. +- Server and workers that communicate on HTTP/S (ports 80 and 443). Additionally, the TCP session feature requires a + range of ports for TCP traffic based on your configuration (see [AI App Gateway installation](appgw_install_k8s.md)). +- DNS Configuration: A domain with subdomain support is required, ideally with trusted TLS certificates. All entries must + be resolvable by the Ingress controller. Example subdomains: + - Control Plane: + - `api.` + - `app.` + - `files.` + - Worker: + - `router.` + - `tcp-router.` (optional, for TCP sessions) +- Storage: A configured StorageClass and an accessible storage backend. +- ClearML Enterprise Access: + - Helm repository credentials (``) + - DockerHub registry credentials (``) +### Recommended Cluster Specifications -## Control Plane Installation +For optimal performance, a Kubernetes cluster with at least 3 nodes is recommended, each having: +- 8 vCPUs +- 32 GB RAM +- 500 GB storage -The following steps cover installing the control plane (server and required charts) and will -require some or all of the tokens/deliverables mentioned above. +## Installation +### Add the Helm Repo Locally -### Requirements +Add the ClearML Helm repository: +``` bash +helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username --password +``` -* Add the ClearML Enterprise repository: +Update the repository locally: +``` bash +helm repo update +``` +### Prepare Values - ``` - helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username --password - ``` - - -* Update the repository locally: - - - ``` - helm repo update - ``` - - -### Install ClearML Enterprise Chart - - -#### Configuration - - -The Helm Chart must be installed with an `overrides.yaml` overriding values as follows: - +Create a `clearml-values.override.yaml` file with the following content: :::note -In the following configuration, replace `` with a valid domain -that will have records pointing to the cluster’s ingress controller (see ingress details in the values below). +In the following configuration, replace the `` placeholders with a valid domain that will have records +pointing to the cluster's Ingress Controller. This will be the base domain for reaching your ClearML installation. ::: - -```yaml +``` yaml imageCredentials: - password: "" - + password: "" clearml: cookieDomain: "" - # Set values for improved security - apiserverKey: "" - apiserverSecret: "" - fileserverKey: "" - fileserverSecret: "" - secureAuthTokenSecret: "" - testUserKey: "" - testUserSecret: "" - apiserver: ingress: enabled: true hostName: "api." service: type: ClusterIP - fileserver: ingress: enabled: true - hostName: "file." + hostName: "files." service: type: ClusterIP - webserver: ingress: enabled: true hostName: "app." service: type: ClusterIP - clearmlApplications: enabled: true ``` -#### Additional Configuration Options -##### Fixed Users (Simple Login) +### Install the Chart +Install the ClearML Enterprise Helm chart using the previous values override file. -Enable static login with username and password in `overrides.yaml`. - - -This is an optional step in case SSO (Identity provider) configuration will not be performed. - - -``` -apiserver: - additionalConfigs: - apiserver.conf: | - auth { - fixed_users { - enabled: true - pass_hashed: false - users: [ - { - username: "my_user" - password: "my_password" - name: "My User" - admin: true - }, - ] - } - } +``` bash +helm upgrade -i -n clearml clearml clearml-enterprise/clearml-enterprise --create-namespace -f clearml-values.override.yaml ``` +## Additional Configuration Options -##### SSO (Identity Provider) +:::note +You can view the full set of available and documented values of the chart by running the following command: - -The following examples (Auth0 and Keycloak) show how to configure an identity provider on the ClearML server. - - -Add the following values configuring `extraEnvs` for `apiserver` in the `clearml-enterprise` values `override.yaml` file. - - -Substitute all ``s with the correct value for your configuration. - - -##### Auth0 Identity Provider - - -```yaml -apiserver: - extraEnvs: - - name: CLEARML__secure__login__sso__oauth_client__auth0__client_id - value: "" - - name: CLEARML__secure__login__sso__oauth_client__auth0__client_secret - value: "" - - name: CLEARML__services__login__sso__oauth_client__auth0__base_url - value: "" - - name: CLEARML__services__login__sso__oauth_client__auth0__authorize_url - value: "" - - name: CLEARML__services__login__sso__oauth_client__auth0__access_token_url - value: "" - - name: CLEARML__services__login__sso__oauth_client__auth0__audience - value: "" +```bash +helm show readme clearml-enterprise/clearml-enterprise +# or +helm show values clearml-enterprise/clearml-enterprise ``` +::: +### Default Secret Values -##### Keycloak Identity Provider +For improved security, all the internal credentials are auto-generated randomly and stored in a Secret in +Kubernetes. +If you need to define your own credentials to be used instead, replace the default key and secret values in `clearml-values.override.yaml`. -```yaml -apiserver: - extraEnvs: - - name: CLEARML__secure__login__sso__oauth_client__keycloak__client_id - value: "" - - name: CLEARML__secure__login__sso__oauth_client__keycloak__client_secret - value: "" - - name: CLEARML__services__login__sso__oauth_client__keycloak__base_url - value: "/realms//" - - name: CLEARML__services__login__sso__oauth_client__keycloak__authorize_url - value: "/realms//protocol/openid-connect/auth" - - name: CLEARML__services__login__sso__oauth_client__keycloak__access_token_url - value: "/realms//protocol/openid-connect/token" - - name: CLEARML__services__login__sso__oauth_client__keycloak__idp_logout - value: "true" -``` - - -#### Installing the Chart - - -``` -helm install -n clearml \ - clearml \ - clearml-enterprise/clearml-enterprise \ - --create-namespace \ - -f overrides.yaml -``` - - -### Install ClearML Agent Chart - - -#### Configuration - - -To configure the agent you will need to choose a Redis password and use that when setting up Redis as well -(see [Shared Redis installation](multi_tenant_k8s.md#shared-redis-installation)). - - -The Helm Chart must be installed with `overrides.yaml`: - - -```yaml -imageCredentials: - password: "" +``` yaml clearml: - agentk8sglueKey: "" - agentk8sglueSecret: "" -agentk8sglue: - apiServerUrlReference: "https://api." - fileServerUrlReference: "https://files." - webServerUrlReference: "https://app." - defaultContainerImage: "python:3.9" + # Replace the following values to use custom internal credentials. + apiserverKey: "" + apiserverSecret: "" + fileserverKey: "" + fileserverSecret: "" + secureAuthTokenSecret: "" + testUserKey: "" + testUserSecret: "" ``` +In a shell, if `openssl` is installed, you can use this simple command to generate random strings suitable as keys and secrets: -#### Installing the Chart - - -```bash -helm install -n \ - clearml-agent \ - clearml-enterprise/clearml-enterprise-agent \ - --create-namespace \ - -f overrides.yaml +``` bash +openssl rand -hex 16 ``` +### Fixed Users -To create a queue by API: +Enable and configure simple login with username and password in `clearml-values.override.yaml`. This is useful for simple PoC +installations. This is an optional step in case the SSO (Identity provider) configuration is not performed. +Please note that this setup is not ideal for multi-tenant setups as fixed users will only be associated with the default tenant. -```bash -curl $APISERVER_URL/queues.create \ --H "Content-Type: application/json" \ --H "X-Clearml-Impersonate-As:" \ --u $APISERVER_KEY:$APISERVER_SECRET \ --d '{"name":"default"}' +``` yaml +apiserver: + additionalConfigs: + apiserver.conf: | + auth { + fixed_users { + enabled: true + pass_hashed: false + users: [ + { + username: "my_user" + password: "my_password" + name: "My User" + admin: true + }, + ] + } + } ``` +## Next Steps -## ClearML AI Application Gateway Installation - - -### Configuring Chart - - -The Helm Chart must be installed with `overrides.yaml`: - - -```yaml -imageCredentials: - password: "" -clearml: - apiServerKey: "" - apiServerSecret: "" - apiServerUrlReference: "https://api." - authCookieName: "" -ingress: - enabled: true - hostName: "task-router.dev" -tcpSession: - routerAddress: "" - portRange: - start: - end: -``` - - -**Configuration options:** - - -* **`clearml.apiServerUrlReference`:** URL usually starting with `https://api.` -* **`clearml.apiServerKey`:** ClearML server API key -* **`clearml.apiServerSecret`:** ClearML server secret key -* **`ingress.hostName`:** URL of the router we configured previously for load balancer starting with `https://` -* **`clearml.sslVerify`:** Enable or disable SSL certificate validation on apiserver calls check -* **`clearml.authCookieName`:** Value from `value_prefix` key starting with `allegro_token` in `envoy.yaml` file in ClearML server installation. -* **`tcpSession.routerAddress`**: Router external address can be an IP or the host machine or a load balancer hostname, depends on the network configuration -* **`tcpSession.portRange.start`**: Start port for the TCP Session feature -* **`tcpSession.portRange.end`**: End port for the TCP Session feature - - -### Installing the Chart - - -```bash -helm install -n \ - clearml-ttr \ - clearml-enterprise/clearml-enterprise-task-traffic-router \ - --create-namespace \ - -f overrides.yaml -``` - - - - -## Applications Installation - - -To install the ClearML Applications on the newly installed ClearML Enterprise control-plane, download the applications -package using the URL provided by the ClearML staff. - - - - -### Download and Extract - - -``` -wget -O apps.zip "" -unzip apps.zip -``` - - -### Adjust Application Docker Images Location (Air-Gapped Systems) - - -ClearML applications use pre-built docker images provided by ClearML on the ClearML DockerHub -repository. If you are using an air-gapped system, these images must be available as part of your internal docker -registry, and the correct docker images location must be specified before installing the applications. - - -Use the following script to adjust the applications packages accordingly before installing the applications: - - -``` -python convert_image_registry.py \ - --apps-dir /path/to/apps/ \ - --repo local_registry/clearml-apps -``` - - -The script will change the application zip files to point to the new registry, and will output the list of containers -that need to be copied to the local registry. For example: - - -``` -make sure allegroai/clearml-apps:hpo-1.10.0-1062 was added to local_registry/clearml-apps -``` - - -### Install Applications - - -Use the `upload_apps.py` script to upload the application packages to the ClearML server: - - -``` -python upload_apps.py \ - --host $APISERVER_ADDRESS \ - --user $APISERVER_USER --password $APISERVER_PASSWORD \ - --dir apps -ml -``` - - -## Configuring Shared Memory for Large Model Deployment - - -Deploying large models may fail due to shared memory size limitations. This issue commonly arises when the allocated -`/dev/shm` space is insufficient.: - - -``` -> 3d3e22c3066f:168:168 [0] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-UbzKZ9 to 9637892 bytes -> 3d3e22c3066f:168:168 [0] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-UbzKZ9 (size 9637888) -> 3d3e22c3066f:168:168 [0] NCCL INFO transport/shm.cc:114 -> 2 -> 3d3e22c3066f:168:168 [0] NCCL INFO transport.cc:33 -> 2 -> 3d3e22c3066f:168:168 [0] NCCL INFO transport.cc:113 -> 2 -> 3d3e22c3066f:168:168 [0] NCCL INFO init.cc:1263 -> 2 -> 3d3e22c3066f:168:168 [0] NCCL INFO init.cc:1548 -> 2 -> 3d3e22c3066f:168:168 [0] NCCL INFO init.cc:1799 -> 2 -``` - - -To configure a proper SHM size you can use the following configuration in the agent `overrides.yaml`. - - -Replace `` with the desired memory allocation in GiB, based on your model requirements. - - -This example configures a specific queue, but you can include this setting in the `basePodTemplate` if you need to -apply it to all tasks. - - -```yaml -agentk8sglue: - queues: - GPUshm: - templateOverrides: - env: - - name: VLLM_SKIP_P2P_CHECK - value: "1" - volumeMounts: - - name: dshm - mountPath: /dev/shm - volumes: - - name: dshm - emptyDir: - medium: Memory - sizeLimit: Gi -``` +Once the ClearML Enterprise control-plane is up and running, proceed with installing the ClearML Enterprise Agent and +[AI App Gateway](appgw_install_k8s.md).