Update Enterprise Server deployment on K8s

2025-06-26 18:17:44 +00:00 · 2025-05-15 12:59:28 +03:00 · 2025-05-15 12:59:28 +03:00 · d0a8cc4448
commit d0a8cc4448
parent f5e3d4e5f0 cb19989308
2 changed files with 121 additions and 371 deletions
--- a/docs/deploying_clearml/enterprise_deploy/appgw_install_k8s.md
+++ b/docs/deploying_clearml/enterprise_deploy/appgw_install_k8s.md
@ -6,13 +6,13 @@ title: Kubernetes Deployment
 The AI Application Gateway is available under the ClearML Enterprise plan.
 :::

-This guide details the installation of the ClearML App Gateway Router.
-The App Gateway Router enables access to your AI workload applications (e.g. remote IDEs like VSCode and Jupyter, model API interface, etc.).
+This guide details the installation of the ClearML App Gateway.
+The App Gateway enables access to your AI workload applications (e.g. remote IDEs like VSCode and Jupyter, model API interface, etc.).
 It acts as a proxy, identifying ClearML Tasks running within its [K8s namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/) 
 and making them available for network access.

 :::important 
-The App Gateway Router must be installed in the same K8s namespace as a dedicated ClearML Agent.
+The App Gateway must be installed in the same K8s namespace as a dedicated ClearML Agent.
 It can only configure access for ClearML Tasks within its own namespace.
 :::

@ -27,35 +27,31 @@ It can only configure access for ClearML Tasks within its own namespace.

 ## Optional for HTTPS

-* A valid DNS entry for the new App Gateway Router instance  
+* A valid DNS entry for the new App Gateway instance  
 * A valid SSL certificate

 ## Helm

 ### Login

-```
-helm repo add clearml-enterprise \
-https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages \
--username <GITHUB_TOKEN> \
--password <GITHUB_TOKEN>
+``` bash
+helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <GITHUB_TOKEN> --password <GITHUB_TOKEN>
 ```

 Replace `<GITHUB_TOKEN>` with your valid GitHub token that has access to the ClearML Enterprise Helm charts repository.

 ### Prepare Values

-Before installing the App Gateway Router, create a Helm override file:
+Before installing the App Gateway, create a Helm override `clearml-app-gateway-values.override.yaml` file:

-```
+```yaml
 imageCredentials:
  password: ""
 clearml:
-  apiServerKey: ""
-  apiServerSecret: ""
+  apiKey: ""
+  apiSecret: ""
  apiServerUrlReference: ""
  authCookieName: ""
-  sslVerify: true
 ingress:
  enabled: true
  hostName: ""
@ -71,13 +67,12 @@ tcpSession:
 **Configuration options:**

 * `imageCredentials.password`: ClearML DockerHub Access Token.
-* `clearml.apiServerKey`: ClearML server API key.  
-* `clearml.apiServerSecret`: ClearML server secret key.  
+* `clearml.apiKey` and `clearml.apiSecret`: [API credentials](../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials) created in the ClearML web UI by an Admin user or Service 
+  Account with admin privileges. Make sure to label these credentials clearly, so that they will not be revoked by mistake.
 * `clearml.apiServerUrlReference`: ClearML API server URL starting with `https://api.`.  
 * `clearml.authCookieName`: Cookie used by the ClearML server to store the ClearML authentication cookie.
-* `clearml.sslVerify`: Enable or disable SSL certificate validation on `apiserver` calls check.  
-* `ingress.hostName`: Hostname of router used by the ingress controller to access it.  
-* `tcpSession.routerAddress`: The external router address (can be an IP, hostname, or load balancer address) depending on your network setup. Ensure this address is accessible for TCP connections.
+* `ingress.hostName`: Hostname of App Gateway used by the ingress controller to access it.  
+* `tcpSession.routerAddress`: The external App Gateway address (can be an IP, hostname, or load balancer address) depending on your network setup. Ensure this address is accessible for TCP connections.
 * `tcpSession.service.type`: Service type used to expose TCP functionality, default is `NodePort`.
 * `tcpSession.portRange.start`: Start port for the TCP Session feature.  
 * `tcpSession.portRange.end`: End port for the TCP Session feature.
@ -85,33 +80,28 @@ tcpSession:

 The full list of supported configuration is available with the command:

-```
-helm show readme allegroai-enterprise/clearml-enterprise-task-traffic-router
+``` bash
+helm show readme clearml-enterprise/clearml-enterprise-app-gateway
 ```

 ### Install

-To install the App Gateway Router component via Helm use the following command:
+To install the App Gateway component via Helm use the following command:

-```
-helm upgrade --install \
-<RELEASE_NAME> \
-n <WORKLOAD_NAMESPACE> \
-allegroai-enterprise/clearml-enterprise-task-traffic-router \
--version <CHART_VERSION> \
-f override.yaml
+``` bash
+helm upgrade --install <RELEASE_NAME> -n <WORKLOAD_NAMESPACE> clearml-enterprise/clearml-enterprise-app-gateway --version <CHART_VERSION> -f clearml-app-gateway-values.override.yaml
 ```

 Replace the placeholders with the following values:

-* `<RELEASE_NAME>` - Unique name for the App Gateway Router within the K8s namespace. This is a required parameter in 
-  Helm, which identifies a specific installation of the chart. The release name also defines the router’s name and 
+* `<RELEASE_NAME>` - Unique name for the App Gateway within the K8s namespace. This is a required parameter in 
+  Helm, which identifies a specific installation of the chart. The release name also defines the App Gateway's name and 
  appears in the UI within AI workload application URLs (e.g. Remote IDE URLs). This can be customized to support multiple installations within the same 
  namespace by assigning different release names.
 * `<WORKLOAD_NAMESPACE>` - [Kubernetes Namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/) 
  where workloads will be executed. This namespace must be shared between a dedicated ClearML Agent and an App 
-  Gateway Router. The agent is responsible for monitoring its assigned task queues and spawning workloads within this 
-  namespace. The router monitors the same namespace for AI workloads (e.g. remote IDE applications). The router has a 
+  Gateway. The agent is responsible for monitoring its assigned task queues and spawning workloads within this 
+  namespace. The App Gateway monitors the same namespace for AI workloads (e.g. remote IDE applications). The App Gateway has a 
  namespace-limited scope, meaning it can only detect and manage tasks within its 
  assigned namespace.
 * `<CHART_VERSION>` - Version recommended by the ClearML Support Team.
--- a/docs/deploying_clearml/enterprise_deploy/k8s.md
+++ b/docs/deploying_clearml/enterprise_deploy/k8s.md
@ -2,408 +2,168 @@
 title: Kubernetes
 ---

+This guide provides step-by-step instructions for installing the ClearML Enterprise Server (control-plane) in a Kubernetes cluster.

-This guide provides step-by-step instructions for installing the ClearML Enterprise setup in a Kubernetes cluster.
+The ClearML Enterprise Server includes the ClearML `apiserver`, `fileserver`, and `webserver` components. 
+The package also includes MongoDB, ElasticSearch, and Redis as Helm dependencies.


 ## Prerequisites

+To deploy a ClearML Server, ensure the following components and configurations are in place:

-* A Kubernetes cluster 
-* An ingress controller (e.g. `nginx-ingress`) and the ability to create LoadBalancer services (e.g. MetalLB) if needed
- to expose ClearML 
-* Credentials for ClearML Enterprise GitHub Helm chart repository
-* Credentials for ClearML Enterprise DockerHub repository 
-* URL for downloading the ClearML Enterprise applications configuration
+- Kubernetes Cluster: A standard Kubernetes cluster is preferred for optimal GPU support.
+- CLI Tools: `kubectl` and `helm` must be installed and configured.
+- Ingress Controller: An Ingress controller (e.g., `nginx-ingress`) is required. If exposing services externally, a 
+  LoadBalancer-capable solution (e.g. `MetalLB`) should also be configured.
+- Server and workers that communicate on HTTP/S (ports 80 and 443). Additionally, the TCP session feature requires a 
+  range of ports for TCP traffic based on your configuration (see [AI App Gateway installation](appgw_install_k8s.md)).
+- DNS Configuration: A domain with subdomain support is required, ideally with trusted TLS certificates. All entries must 
+  be resolvable by the Ingress controller. Example subdomains:
+  - Server:
+    - `api.<BASE_DOMAIN>`
+    - `app.<BASE_DOMAIN>`
+    - `files.<BASE_DOMAIN>`
+  - Worker:
+    - `router.<BASE_DOMAIN>`
+    - `tcp-router.<BASE_DOMAIN>` (optional, for TCP sessions)
+- Storage: A configured StorageClass and an accessible storage backend.
+- ClearML Enterprise Access:
+  - Helm repository credentials (`<HELM_REPO_TOKEN>`)
+  - DockerHub registry credentials (`<CLEARML_DOCKERHUB_TOKEN>`)

+### Recommended Cluster Specifications

-## Control Plane Installation
+For optimal performance, a Kubernetes cluster with at least 3 nodes is recommended, each having:

+- 8 vCPUs
+- 32 GB RAM
+- 500 GB storage

-The following steps cover installing the control plane (server and required charts) and will
-require some or all of the tokens/deliverables mentioned above.
+## Installation

+### Add the Helm Repo Locally

-### Requirements
+Add the ClearML Helm repository:

+``` bash
+helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
+```

-* Add the ClearML Enterprise repository:
+Update the repository locally:
+``` bash
+helm repo update
+```

+### Prepare Values

- ```
- helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <clearmlenterprise_GitHub_TOKEN> --password <clearmlenterprise_GitHub_TOKEN>
- ```
-
-
-* Update the repository locally:
-
-
- ```
- helm repo update
- ```
-
-
-### Install ClearML Enterprise Chart
-
-
-#### Configuration
-
-
-The Helm Chart must be installed with an `overrides.yaml` overriding values as follows:
-
+Create a `clearml-values.override.yaml` file with the following content:

 :::note
-In the following configuration, replace `<BASE_DOMAIN>` with a valid domain
-that will have records pointing to the cluster’s ingress controller (see ingress details in the values below).
+In the following configuration, replace the `<BASE_DOMAIN>` placeholders with a valid domain that will have records 
+pointing to the cluster's Ingress Controller. This will be the base domain for reaching your ClearML installation.
 :::

-
-```yaml
+``` yaml
 imageCredentials:
-  password: "<clearml_enterprise_DockerHub_TOKEN>"
-
+  password: "<CLEARML_DOCKERHUB_TOKEN>"
 clearml:
  cookieDomain: "<BASE_DOMAIN>"
-  # Set values for improved security
-  apiserverKey: "<GENERATED_API_SERVER_KEY>"
-  apiserverSecret: "<GENERATED_API_SERVER_SECRET>"
-  fileserverKey: "<GENERATED_FILE_SERVER_KEY>"
-  fileserverSecret: "<GENERATED_FILE_SERVER_SECRET>"
-  secureAuthTokenSecret: "<GENERATED_AUTH_TOKEN_SECRET>"
-  testUserKey: "<GENERATED_TEST_USER_KEY>"
-  testUserSecret: "<GENERATED_TEST_USER_SECRET>"
-
 apiserver:
  ingress:
    enabled: true
    hostName: "api.<BASE_DOMAIN>"
  service:
    type: ClusterIP
-
 fileserver:
  ingress:
    enabled: true
-    hostName: "file.<BASE_DOMAIN>"
+    hostName: "files.<BASE_DOMAIN>"
  service:
    type: ClusterIP
-
 webserver:
  ingress:
    enabled: true
    hostName: "app.<BASE_DOMAIN>"
  service:
    type: ClusterIP
-
 clearmlApplications:
  enabled: true
 ```

-#### Additional Configuration Options
-##### Fixed Users (Simple Login)
+### Install the Chart

+Install the ClearML Enterprise Helm chart using the previous values override file.

-Enable static login with username and password in `overrides.yaml`.
-
-
-This is an optional step in case SSO (Identity provider) configuration will not be performed.
-
-
-```
-apiserver:
- additionalConfigs:
-   apiserver.conf: |
-     auth {
-       fixed_users {
-         enabled: true
-         pass_hashed: false
-         users: [
-           {
-             username: "my_user"
-             password: "my_password"
-             name: "My User"
-             admin: true
-           },
-         ]
-       }
-     }
+``` bash
+helm upgrade -i -n clearml clearml clearml-enterprise/clearml-enterprise --create-namespace -f clearml-values.override.yaml 
 ```

+## Additional Configuration Options

-##### SSO (Identity Provider)
+:::note
+You can view the full set of available and documented values of the chart by running the following command:

-
-The following examples (Auth0 and Keycloak) show how to configure an identity provider on the ClearML server.
-
-
-Add the following values configuring `extraEnvs` for `apiserver` in the `clearml-enterprise` values `override.yaml` file.
-
-
-Substitute all `<PLACEHOLDER>`s with the correct value for your configuration.
-
-
-##### Auth0 Identity Provider
-
-
-```yaml
-apiserver:
- extraEnvs:
-   - name: CLEARML__secure__login__sso__oauth_client__auth0__client_id
-     value: "<SSO_CLIENT_ID>"
-   - name: CLEARML__secure__login__sso__oauth_client__auth0__client_secret
-     value: "<SSO_CLIENT_SECRET>"
-   - name: CLEARML__services__login__sso__oauth_client__auth0__base_url
-     value: "<SSO_CLIENT_URL>"
-   - name: CLEARML__services__login__sso__oauth_client__auth0__authorize_url
-     value: "<SSO_CLIENT_AUTHORIZE_URL>"
-   - name: CLEARML__services__login__sso__oauth_client__auth0__access_token_url
-     value: "<SSO_CLIENT_ACCESS_TOKEN_URL>"
-   - name: CLEARML__services__login__sso__oauth_client__auth0__audience
-     value: "<SSO_CLIENT_AUDIENCE>"
+```bash
+helm show readme clearml-enterprise/clearml-enterprise
+# or
+helm show values clearml-enterprise/clearml-enterprise
 ```
+:::

+### Default Secret Values

-##### Keycloak Identity Provider
+For improved security, all the internal credentials are auto-generated randomly and stored in a Secret in 
+Kubernetes.

+If you need to define your own credentials to be used instead, replace the default key and secret values in `clearml-values.override.yaml`.

-```yaml
-apiserver:
- extraEnvs:
-   - name: CLEARML__secure__login__sso__oauth_client__keycloak__client_id
-     value: "<KC_CLIENT_ID>"
-   - name: CLEARML__secure__login__sso__oauth_client__keycloak__client_secret
-     value: "<KC_SECRET_ID>"
-   - name: CLEARML__services__login__sso__oauth_client__keycloak__base_url
-     value: "<KC_URL>/realms/<REALM_NAME>/"
-   - name: CLEARML__services__login__sso__oauth_client__keycloak__authorize_url
-     value: "<KC_URL>/realms/<REALM_NAME>/protocol/openid-connect/auth"
-   - name: CLEARML__services__login__sso__oauth_client__keycloak__access_token_url
-     value: "<KC_URL>/realms/<REALM_NAME>/protocol/openid-connect/token"
-   - name: CLEARML__services__login__sso__oauth_client__keycloak__idp_logout
-     value: "true"
-```
-
-
-#### Installing the Chart
-
-
-```
-helm install -n clearml \
-    clearml \
-    clearml-enterprise/clearml-enterprise \
-    --create-namespace \
-    -f overrides.yaml
-```
-
-
-### Install ClearML Agent Chart
-
-
-#### Configuration
-
-
-To configure the agent you will need to choose a Redis password and use that when setting up Redis as well
-(see [Shared Redis installation](multi_tenant_k8s.md#shared-redis-installation)).
-
-
-The Helm Chart must be installed with `overrides.yaml`:
-
-
-```yaml
-imageCredentials:
-  password: "<CLEARML_DOCKERHUB_TOKEN>"
+``` yaml
 clearml:
-  agentk8sglueKey: "<ACCESS_KEY>"
-  agentk8sglueSecret: "<SECRET_KEY>"
-agentk8sglue:
-  apiServerUrlReference: "https://api.<BASE_DOMAIN>"
-  fileServerUrlReference: "https://files.<BASE_DOMAIN>"
-  webServerUrlReference: "https://app.<BASE_DOMAIN>"
-  defaultContainerImage: "python:3.9"
+  # Replace the following values to use custom internal credentials.
+  apiserverKey: ""
+  apiserverSecret: ""
+  fileserverKey: ""
+  fileserverSecret: ""
+  secureAuthTokenSecret: ""
+  testUserKey: ""
+  testUserSecret: ""
 ```

+In a shell, if `openssl` is installed, you can use this simple command to generate random strings suitable as keys and secrets:

-#### Installing the Chart
-
-
-```bash
-helm install -n <WORKLOAD_NAMESPACE> \
-    clearml-agent \
-    clearml-enterprise/clearml-enterprise-agent \
-    --create-namespace \
-    -f overrides.yaml
+``` bash
+openssl rand -hex 16
 ```

+### Fixed Users

-To create a queue by API:
+Enable and configure simple login with username and password in `clearml-values.override.yaml`. This is useful for simple PoC 
+installations. This is an optional step in case the SSO (Identity provider) configuration is not performed.

+Please note that this setup is not ideal for multi-tenant setups as fixed users will only be associated with the default tenant.

-```bash
-curl $APISERVER_URL/queues.create \
-H "Content-Type: application/json" \
-H "X-Clearml-Impersonate-As:<USER_ID>" \
-u $APISERVER_KEY:$APISERVER_SECRET \
-d '{"name":"default"}'
+``` yaml
+apiserver:
+  additionalConfigs:
+    apiserver.conf: |
+      auth {
+        fixed_users {
+          enabled: true
+          pass_hashed: false
+          users: [
+            {
+              username: "my_user"
+              password: "my_password"
+              name: "My User"
+              admin: true
+            },
+          ]
+        }
+      }
 ```

+## Next Steps

-## ClearML AI Application Gateway Installation
-
-
-### Configuring Chart
-
-
-The Helm Chart must be installed with `overrides.yaml`:
-
-
-```yaml
-imageCredentials:
-  password: "<DOCKERHUB_TOKEN>"
-clearml:
-  apiServerKey: ""
-  apiServerSecret: ""
-  apiServerUrlReference: "https://api."
-  authCookieName: ""
-ingress:
-  enabled: true
-  hostName: "task-router.dev"
-tcpSession:
-  routerAddress: "<NODE_IP OR EXTERNAL_NAME>"
-  portRange:
-    start: <START_PORT>
-    end: <END_PORT>
-```
-
-
-**Configuration options:**
-
-
-* **`clearml.apiServerUrlReference`:** URL usually starting with `https://api.` 
-* **`clearml.apiServerKey`:** ClearML server API key 
-* **`clearml.apiServerSecret`:** ClearML server secret key 
-* **`ingress.hostName`:** URL of the router we configured previously for load balancer starting with `https://` 
-* **`clearml.sslVerify`:** Enable or disable SSL certificate validation on apiserver calls check
-* **`clearml.authCookieName`:** Value from `value_prefix` key starting with `allegro_token` in `envoy.yaml` file in ClearML server installation. 
-* **`tcpSession.routerAddress`**: Router external address can be an IP or the host machine or a load balancer hostname, depends on the network configuration 
-* **`tcpSession.portRange.start`**: Start port for the TCP Session feature 
-* **`tcpSession.portRange.end`**: End port for the TCP Session feature
-
-
-### Installing the Chart
-
-
-```bash
-helm install -n <WORKLOAD_NAMESPACE> \
-    clearml-ttr \
-    clearml-enterprise/clearml-enterprise-task-traffic-router \
-    --create-namespace \
-    -f overrides.yaml
-```
-
-
-
-
-## Applications Installation
-
-
-To install the ClearML Applications on the newly installed ClearML Enterprise control-plane, download the applications
-package using the URL provided by the ClearML staff.
-
-
-
-
-### Download and Extract
-
-
-```
-wget -O apps.zip "<ClearML enterprise applications configuration download url>"
-unzip apps.zip
-```
-
-
-### Adjust Application Docker Images Location  (Air-Gapped Systems)
-
-
-ClearML applications use pre-built docker images provided by ClearML on the ClearML DockerHub
-repository. If you are using an air-gapped system, these images must be available as part of your internal docker
-registry, and the correct docker images location must be specified before installing the applications. 
-
-
-Use the following script to adjust the applications packages accordingly before installing the applications:
-
-
-```
-python convert_image_registry.py \
- --apps-dir /path/to/apps/ \
- --repo local_registry/clearml-apps
-```
-
-
-The script will change the application zip files to point to the new registry, and will output the list of containers
-that need to be copied to the local registry. For example:
-
-
-```
-make sure allegroai/clearml-apps:hpo-1.10.0-1062 was added to local_registry/clearml-apps
-```
-
-
-### Install Applications
-
-
-Use the `upload_apps.py` script to upload the application packages to the ClearML server:
-
-
-```
-python upload_apps.py \
- --host $APISERVER_ADDRESS \
- --user $APISERVER_USER --password $APISERVER_PASSWORD \
- --dir apps -ml
-```
-
-
-## Configuring Shared Memory for Large Model Deployment
-
-
-Deploying large models may fail due to shared memory size limitations. This issue commonly arises when the allocated
-`/dev/shm` space is insufficient.:
-
-
-```
-> 3d3e22c3066f:168:168 [0] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-UbzKZ9 to 9637892 bytes
-> 3d3e22c3066f:168:168 [0] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-UbzKZ9 (size 9637888)
-> 3d3e22c3066f:168:168 [0] NCCL INFO transport/shm.cc:114 -> 2
-> 3d3e22c3066f:168:168 [0] NCCL INFO transport.cc:33 -> 2
-> 3d3e22c3066f:168:168 [0] NCCL INFO transport.cc:113 -> 2
-> 3d3e22c3066f:168:168 [0] NCCL INFO init.cc:1263 -> 2
-> 3d3e22c3066f:168:168 [0] NCCL INFO init.cc:1548 -> 2
-> 3d3e22c3066f:168:168 [0] NCCL INFO init.cc:1799 -> 2
-```
-
-
-To configure a proper SHM size you can use the following configuration in the agent `overrides.yaml`.
-
-
-Replace `<SIZE>` with the desired memory allocation in GiB, based on your model requirements.
-
-
-This example configures a specific queue, but you can include this setting in the `basePodTemplate` if you need to
-apply it to all tasks.
-
-
-```yaml
-agentk8sglue: 
-  queues:
-    GPUshm:
-      templateOverrides:
-        env:
-          - name: VLLM_SKIP_P2P_CHECK
-            value: "1"
-        volumeMounts:
-          - name: dshm
-            mountPath: /dev/shm
-        volumes:
-          - name: dshm
-            emptyDir:
-              medium: Memory
-              sizeLimit: <SIZE>Gi
-```
+Once the ClearML Enterprise Server is up and running, proceed with installing the ClearML Enterprise Agent and 
+[AI App Gateway](appgw_install_k8s.md).