Update Enterprise Server deployment on K8s

2025-05-18 03:05:21 +00:00 · 2025-05-13 08:31:25 +03:00 · 2025-05-13 08:31:25 +03:00 · c3b4224a6f
commit c3b4224a6f
parent c01766f852
1 changed files with 98 additions and 338 deletions
--- a/docs/deploying_clearml/enterprise_deploy/k8s.md
+++ b/docs/deploying_clearml/enterprise_deploy/k8s.md
@ -2,113 +2,148 @@
 title: Kubernetes
 ---
 This guide provides step-by-step instructions for installing the ClearML Enterprise control-plane setup in a Kubernetes cluster.
-This guide provides step-by-step instructions for installing the ClearML Enterprise setup in a Kubernetes cluster.
+ClearML Enterprise is the main ClearML Server, comprising the ClearML `apiserver`, `fileserver`, and `webserver` components. 
 The package also includes MongoDB, ElasticSearch, and Redis as Helm dependencies.
 ## Prerequisites
 To deploy ClearML Enterprise, ensure the following components and configurations are in place:
-* A Kubernetes cluster 
+- Kubernetes Cluster: A vanilla Kubernetes cluster is preferred for optimal GPU support.
-* An ingress controller (e.g. `nginx-ingress`) and the ability to create LoadBalancer services (e.g. MetalLB) if needed
+- CLI Tools: `kubectl` and `helm` must be installed and configured.
- to expose ClearML 
+- Ingress Controller: An Ingress controller (e.g., `nginx-ingress`) is required. If exposing services externally, a 
-* Credentials for ClearML Enterprise GitHub Helm chart repository
+  LoadBalancer-capable solution (e.g. `MetalLB`) should also be configured.
-* Credentials for ClearML Enterprise DockerHub repository 
+- Server and workers that communicate on HTTP/S (ports 80 and 443). Additionally, the TCP session feature requires a 
-* URL for downloading the ClearML Enterprise applications configuration
+  range of ports for TCP traffic based on your configuration (see [AI App Gateway installation](appgw_install_k8s.md)).
 - DNS Configuration: A domain with subdomain support is required, ideally with trusted TLS certificates. All entries must 
  be resolvable by the Ingress controller. Example subdomains:
  - Control Plane:
    - `api.<BASE_DOMAIN>`
    - `app.<BASE_DOMAIN>`
    - `files.<BASE_DOMAIN>`
  - Worker:
    - `router.<BASE_DOMAIN>`
    - `tcp-router.<BASE_DOMAIN>` (optional, for TCP sessions)
 - Storage: A configured StorageClass and an accessible storage backend.
 - ClearML Enterprise Access:
  - Helm repository credentials (`<HELM_REPO_TOKEN>`)
  - DockerHub registry credentials (`<CLEARML_DOCKERHUB_TOKEN>`)
 ### Recommended Cluster Specifications
-## Control Plane Installation
+For optimal performance, a Kubernetes cluster with at least 3 nodes is recommended, each having:
 - 8 vCPUs
 - 32 GB RAM
 - 500 GB storage
-The following steps cover installing the control plane (server and required charts) and will
+## Installation
 require some or all of the tokens/deliverables mentioned above.
 ### Add the Helm Repo Locally
-### Requirements
+Add the ClearML Helm repository:
 ``` bash
 helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
 ```
-* Add the ClearML Enterprise repository:
+Update the repository locally:
 ``` bash
 helm repo update
 ```
 ### Prepare Values
- ```
+Create a `clearml-values.override.yaml` file with the following content:
 helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <clearmlenterprise_GitHub_TOKEN> --password <clearmlenterprise_GitHub_TOKEN>
 ```
 * Update the repository locally:
 ```
 helm repo update
 ```
 ### Install ClearML Enterprise Chart
 #### Configuration
 The Helm Chart must be installed with an `overrides.yaml` overriding values as follows:
 :::note
-In the following configuration, replace `<BASE_DOMAIN>` with a valid domain
+In the following configuration, replace the `<BASE_DOMAIN>` placeholders with a valid domain that will have records 
-that will have records pointing to the cluster’s ingress controller (see ingress details in the values below).
+pointing to the cluster's Ingress Controller. This will be the base domain for reaching your ClearML installation.
 :::
-
+``` yaml
 ```yaml
 imageCredentials:
-  password: "<clearml_enterprise_DockerHub_TOKEN>"
+  password: "<CLEARML_DOCKERHUB_TOKEN>"
 clearml:
  cookieDomain: "<BASE_DOMAIN>"
  # Set values for improved security
  apiserverKey: "<GENERATED_API_SERVER_KEY>"
  apiserverSecret: "<GENERATED_API_SERVER_SECRET>"
  fileserverKey: "<GENERATED_FILE_SERVER_KEY>"
  fileserverSecret: "<GENERATED_FILE_SERVER_SECRET>"
  secureAuthTokenSecret: "<GENERATED_AUTH_TOKEN_SECRET>"
  testUserKey: "<GENERATED_TEST_USER_KEY>"
  testUserSecret: "<GENERATED_TEST_USER_SECRET>"
 apiserver:
  ingress:
    enabled: true
    hostName: "api.<BASE_DOMAIN>"
  service:
    type: ClusterIP
 fileserver:
  ingress:
    enabled: true
-    hostName: "file.<BASE_DOMAIN>"
+    hostName: "files.<BASE_DOMAIN>"
  service:
    type: ClusterIP
 webserver:
  ingress:
    enabled: true
    hostName: "app.<BASE_DOMAIN>"
  service:
    type: ClusterIP
 clearmlApplications:
  enabled: true
 ```
-#### Additional Configuration Options
+### Install the Chart
 ##### Fixed Users (Simple Login)
 Enable static login with username and password in `overrides.yaml`.
 This is an optional step in case SSO (Identity provider) configuration will not be performed.
 Install the ClearML Enterprise Helm chart using the previous values override file.
 ``` bash
 helm upgrade -i -n clearml clearml clearml-enterprise/clearml-enterprise --create-namespace -f clearml-values.override.yaml 
 ```
 ## Additional Configuration Options
 :::note
 You can view the full set of available and documented values of the chart by running the following command:
 ```bash
 helm show readme clearml-enterprise/clearml-enterprise
 # or
 helm show values clearml-enterprise/clearml-enterprise
 ```
 :::
 ### Default Secret Values
 For improved security, all the internal credentials are auto-generated randomly and stored in a Secret in 
 Kubernetes.
 If you need to define your own credentials to be used instead, replace the default key and secret values in `clearml-values.override.yaml`.
 ``` yaml
 clearml:
  # Replace the following values to use custom internal credentials.
  apiserverKey: ""
  apiserverSecret: ""
  fileserverKey: ""
  fileserverSecret: ""
  secureAuthTokenSecret: ""
  testUserKey: ""
  testUserSecret: ""
 ```
 In a shell, if `openssl` is installed, you can use this simple command to generate random strings suitable as keys and secrets:
 ``` bash
 openssl rand -hex 16
 ```
 ### Fixed Users
 Enable and configure simple login with username and password in `clearml-values.override.yaml`. This is useful for simple PoC 
 installations. This is an optional step in case the SSO (Identity provider) configuration is not performed.
 Please note that this setup is not ideal for multi-tenant setups as fixed users will only be associated with the default tenant.
 ``` yaml
 apiserver:
  additionalConfigs:
    apiserver.conf: |
@ -128,282 +163,7 @@ apiserver:
      }
 ```
 ## Next Steps
-##### SSO (Identity Provider)
+Once the ClearML Enterprise control-plane is up and running, proceed with installing the ClearML Enterprise Agent and 
-
+[AI App Gateway](appgw_install_k8s.md).
 The following examples (Auth0 and Keycloak) show how to configure an identity provider on the ClearML server.
 Add the following values configuring `extraEnvs` for `apiserver` in the `clearml-enterprise` values `override.yaml` file.
 Substitute all `<PLACEHOLDER>`s with the correct value for your configuration.
 ##### Auth0 Identity Provider
 ```yaml
 apiserver:
 extraEnvs:
   - name: CLEARML__secure__login__sso__oauth_client__auth0__client_id
     value: "<SSO_CLIENT_ID>"
   - name: CLEARML__secure__login__sso__oauth_client__auth0__client_secret
     value: "<SSO_CLIENT_SECRET>"
   - name: CLEARML__services__login__sso__oauth_client__auth0__base_url
     value: "<SSO_CLIENT_URL>"
   - name: CLEARML__services__login__sso__oauth_client__auth0__authorize_url
     value: "<SSO_CLIENT_AUTHORIZE_URL>"
   - name: CLEARML__services__login__sso__oauth_client__auth0__access_token_url
     value: "<SSO_CLIENT_ACCESS_TOKEN_URL>"
   - name: CLEARML__services__login__sso__oauth_client__auth0__audience
     value: "<SSO_CLIENT_AUDIENCE>"
 ```
 ##### Keycloak Identity Provider
 ```yaml
 apiserver:
 extraEnvs:
   - name: CLEARML__secure__login__sso__oauth_client__keycloak__client_id
     value: "<KC_CLIENT_ID>"
   - name: CLEARML__secure__login__sso__oauth_client__keycloak__client_secret
     value: "<KC_SECRET_ID>"
   - name: CLEARML__services__login__sso__oauth_client__keycloak__base_url
     value: "<KC_URL>/realms/<REALM_NAME>/"
   - name: CLEARML__services__login__sso__oauth_client__keycloak__authorize_url
     value: "<KC_URL>/realms/<REALM_NAME>/protocol/openid-connect/auth"
   - name: CLEARML__services__login__sso__oauth_client__keycloak__access_token_url
     value: "<KC_URL>/realms/<REALM_NAME>/protocol/openid-connect/token"
   - name: CLEARML__services__login__sso__oauth_client__keycloak__idp_logout
     value: "true"
 ```
 #### Installing the Chart
 ```
 helm install -n clearml \
    clearml \
    clearml-enterprise/clearml-enterprise \
    --create-namespace \
    -f overrides.yaml
 ```
 ### Install ClearML Agent Chart
 #### Configuration
 To configure the agent you will need to choose a Redis password and use that when setting up Redis as well
 (see [Shared Redis installation](multi_tenant_k8s.md#shared-redis-installation)).
 The Helm Chart must be installed with `overrides.yaml`:
 ```yaml
 imageCredentials:
  password: "<CLEARML_DOCKERHUB_TOKEN>"
 clearml:
  agentk8sglueKey: "<ACCESS_KEY>"
  agentk8sglueSecret: "<SECRET_KEY>"
 agentk8sglue:
  apiServerUrlReference: "https://api.<BASE_DOMAIN>"
  fileServerUrlReference: "https://files.<BASE_DOMAIN>"
  webServerUrlReference: "https://app.<BASE_DOMAIN>"
  defaultContainerImage: "python:3.9"
 ```
 #### Installing the Chart
 ```bash
 helm install -n <WORKLOAD_NAMESPACE> \
    clearml-agent \
    clearml-enterprise/clearml-enterprise-agent \
    --create-namespace \
    -f overrides.yaml
 ```
 To create a queue by API:
 ```bash
 curl $APISERVER_URL/queues.create \
 -H "Content-Type: application/json" \
 -H "X-Clearml-Impersonate-As:<USER_ID>" \
 -u $APISERVER_KEY:$APISERVER_SECRET \
 -d '{"name":"default"}'
 ```
 ## ClearML AI Application Gateway Installation
 ### Configuring Chart
 The Helm Chart must be installed with `overrides.yaml`:
 ```yaml
 imageCredentials:
  password: "<DOCKERHUB_TOKEN>"
 clearml:
  apiServerKey: ""
  apiServerSecret: ""
  apiServerUrlReference: "https://api."
  authCookieName: ""
 ingress:
  enabled: true
  hostName: "task-router.dev"
 tcpSession:
  routerAddress: "<NODE_IP OR EXTERNAL_NAME>"
  portRange:
    start: <START_PORT>
    end: <END_PORT>
 ```
 **Configuration options:**
 * **`clearml.apiServerUrlReference`:** URL usually starting with `https://api.` 
 * **`clearml.apiServerKey`:** ClearML server API key 
 * **`clearml.apiServerSecret`:** ClearML server secret key 
 * **`ingress.hostName`:** URL of the router we configured previously for load balancer starting with `https://` 
 * **`clearml.sslVerify`:** Enable or disable SSL certificate validation on apiserver calls check
 * **`clearml.authCookieName`:** Value from `value_prefix` key starting with `allegro_token` in `envoy.yaml` file in ClearML server installation. 
 * **`tcpSession.routerAddress`**: Router external address can be an IP or the host machine or a load balancer hostname, depends on the network configuration 
 * **`tcpSession.portRange.start`**: Start port for the TCP Session feature 
 * **`tcpSession.portRange.end`**: End port for the TCP Session feature
 ### Installing the Chart
 ```bash
 helm install -n <WORKLOAD_NAMESPACE> \
    clearml-ttr \
    clearml-enterprise/clearml-enterprise-task-traffic-router \
    --create-namespace \
    -f overrides.yaml
 ```
 ## Applications Installation
 To install the ClearML Applications on the newly installed ClearML Enterprise control-plane, download the applications
 package using the URL provided by the ClearML staff.
 ### Download and Extract
 ```
 wget -O apps.zip "<ClearML enterprise applications configuration download url>"
 unzip apps.zip
 ```
 ### Adjust Application Docker Images Location  (Air-Gapped Systems)
 ClearML applications use pre-built docker images provided by ClearML on the ClearML DockerHub
 repository. If you are using an air-gapped system, these images must be available as part of your internal docker
 registry, and the correct docker images location must be specified before installing the applications. 
 Use the following script to adjust the applications packages accordingly before installing the applications:
 ```
 python convert_image_registry.py \
 --apps-dir /path/to/apps/ \
 --repo local_registry/clearml-apps
 ```
 The script will change the application zip files to point to the new registry, and will output the list of containers
 that need to be copied to the local registry. For example:
 ```
 make sure allegroai/clearml-apps:hpo-1.10.0-1062 was added to local_registry/clearml-apps
 ```
 ### Install Applications
 Use the `upload_apps.py` script to upload the application packages to the ClearML server:
 ```
 python upload_apps.py \
 --host $APISERVER_ADDRESS \
 --user $APISERVER_USER --password $APISERVER_PASSWORD \
 --dir apps -ml
 ```
 ## Configuring Shared Memory for Large Model Deployment
 Deploying large models may fail due to shared memory size limitations. This issue commonly arises when the allocated
 `/dev/shm` space is insufficient.:
 ```
 > 3d3e22c3066f:168:168 [0] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-UbzKZ9 to 9637892 bytes
 > 3d3e22c3066f:168:168 [0] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-UbzKZ9 (size 9637888)
 > 3d3e22c3066f:168:168 [0] NCCL INFO transport/shm.cc:114 -> 2
 > 3d3e22c3066f:168:168 [0] NCCL INFO transport.cc:33 -> 2
 > 3d3e22c3066f:168:168 [0] NCCL INFO transport.cc:113 -> 2
 > 3d3e22c3066f:168:168 [0] NCCL INFO init.cc:1263 -> 2
 > 3d3e22c3066f:168:168 [0] NCCL INFO init.cc:1548 -> 2
 > 3d3e22c3066f:168:168 [0] NCCL INFO init.cc:1799 -> 2
 ```
 To configure a proper SHM size you can use the following configuration in the agent `overrides.yaml`.
 Replace `<SIZE>` with the desired memory allocation in GiB, based on your model requirements.
 This example configures a specific queue, but you can include this setting in the `basePodTemplate` if you need to
 apply it to all tasks.
 ```yaml
 agentk8sglue: 
  queues:
    GPUshm:
      templateOverrides:
        env:
          - name: VLLM_SKIP_P2P_CHECK
            value: "1"
        volumeMounts:
          - name: dshm
            mountPath: /dev/shm
        volumes:
          - name: dshm
            emptyDir:
              medium: Memory
              sizeLimit: <SIZE>Gi
 ```