mirror of
https://github.com/clearml/clearml-docs
synced 2025-05-16 10:26:19 +00:00
This commit is contained in:
commit
d0a8cc4448
@ -6,13 +6,13 @@ title: Kubernetes Deployment
|
||||
The AI Application Gateway is available under the ClearML Enterprise plan.
|
||||
:::
|
||||
|
||||
This guide details the installation of the ClearML App Gateway Router.
|
||||
The App Gateway Router enables access to your AI workload applications (e.g. remote IDEs like VSCode and Jupyter, model API interface, etc.).
|
||||
This guide details the installation of the ClearML App Gateway.
|
||||
The App Gateway enables access to your AI workload applications (e.g. remote IDEs like VSCode and Jupyter, model API interface, etc.).
|
||||
It acts as a proxy, identifying ClearML Tasks running within its [K8s namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/)
|
||||
and making them available for network access.
|
||||
|
||||
:::important
|
||||
The App Gateway Router must be installed in the same K8s namespace as a dedicated ClearML Agent.
|
||||
The App Gateway must be installed in the same K8s namespace as a dedicated ClearML Agent.
|
||||
It can only configure access for ClearML Tasks within its own namespace.
|
||||
:::
|
||||
|
||||
@ -27,35 +27,31 @@ It can only configure access for ClearML Tasks within its own namespace.
|
||||
|
||||
## Optional for HTTPS
|
||||
|
||||
* A valid DNS entry for the new App Gateway Router instance
|
||||
* A valid DNS entry for the new App Gateway instance
|
||||
* A valid SSL certificate
|
||||
|
||||
## Helm
|
||||
|
||||
### Login
|
||||
|
||||
```
|
||||
helm repo add clearml-enterprise \
|
||||
https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages \
|
||||
--username <GITHUB_TOKEN> \
|
||||
--password <GITHUB_TOKEN>
|
||||
``` bash
|
||||
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <GITHUB_TOKEN> --password <GITHUB_TOKEN>
|
||||
```
|
||||
|
||||
Replace `<GITHUB_TOKEN>` with your valid GitHub token that has access to the ClearML Enterprise Helm charts repository.
|
||||
|
||||
### Prepare Values
|
||||
|
||||
Before installing the App Gateway Router, create a Helm override file:
|
||||
Before installing the App Gateway, create a Helm override `clearml-app-gateway-values.override.yaml` file:
|
||||
|
||||
```
|
||||
```yaml
|
||||
imageCredentials:
|
||||
password: ""
|
||||
clearml:
|
||||
apiServerKey: ""
|
||||
apiServerSecret: ""
|
||||
apiKey: ""
|
||||
apiSecret: ""
|
||||
apiServerUrlReference: ""
|
||||
authCookieName: ""
|
||||
sslVerify: true
|
||||
ingress:
|
||||
enabled: true
|
||||
hostName: ""
|
||||
@ -71,13 +67,12 @@ tcpSession:
|
||||
**Configuration options:**
|
||||
|
||||
* `imageCredentials.password`: ClearML DockerHub Access Token.
|
||||
* `clearml.apiServerKey`: ClearML server API key.
|
||||
* `clearml.apiServerSecret`: ClearML server secret key.
|
||||
* `clearml.apiKey` and `clearml.apiSecret`: [API credentials](../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials) created in the ClearML web UI by an Admin user or Service
|
||||
Account with admin privileges. Make sure to label these credentials clearly, so that they will not be revoked by mistake.
|
||||
* `clearml.apiServerUrlReference`: ClearML API server URL starting with `https://api.`.
|
||||
* `clearml.authCookieName`: Cookie used by the ClearML server to store the ClearML authentication cookie.
|
||||
* `clearml.sslVerify`: Enable or disable SSL certificate validation on `apiserver` calls check.
|
||||
* `ingress.hostName`: Hostname of router used by the ingress controller to access it.
|
||||
* `tcpSession.routerAddress`: The external router address (can be an IP, hostname, or load balancer address) depending on your network setup. Ensure this address is accessible for TCP connections.
|
||||
* `ingress.hostName`: Hostname of App Gateway used by the ingress controller to access it.
|
||||
* `tcpSession.routerAddress`: The external App Gateway address (can be an IP, hostname, or load balancer address) depending on your network setup. Ensure this address is accessible for TCP connections.
|
||||
* `tcpSession.service.type`: Service type used to expose TCP functionality, default is `NodePort`.
|
||||
* `tcpSession.portRange.start`: Start port for the TCP Session feature.
|
||||
* `tcpSession.portRange.end`: End port for the TCP Session feature.
|
||||
@ -85,33 +80,28 @@ tcpSession:
|
||||
|
||||
The full list of supported configuration is available with the command:
|
||||
|
||||
```
|
||||
helm show readme allegroai-enterprise/clearml-enterprise-task-traffic-router
|
||||
``` bash
|
||||
helm show readme clearml-enterprise/clearml-enterprise-app-gateway
|
||||
```
|
||||
|
||||
### Install
|
||||
|
||||
To install the App Gateway Router component via Helm use the following command:
|
||||
To install the App Gateway component via Helm use the following command:
|
||||
|
||||
```
|
||||
helm upgrade --install \
|
||||
<RELEASE_NAME> \
|
||||
-n <WORKLOAD_NAMESPACE> \
|
||||
allegroai-enterprise/clearml-enterprise-task-traffic-router \
|
||||
--version <CHART_VERSION> \
|
||||
-f override.yaml
|
||||
``` bash
|
||||
helm upgrade --install <RELEASE_NAME> -n <WORKLOAD_NAMESPACE> clearml-enterprise/clearml-enterprise-app-gateway --version <CHART_VERSION> -f clearml-app-gateway-values.override.yaml
|
||||
```
|
||||
|
||||
Replace the placeholders with the following values:
|
||||
|
||||
* `<RELEASE_NAME>` - Unique name for the App Gateway Router within the K8s namespace. This is a required parameter in
|
||||
Helm, which identifies a specific installation of the chart. The release name also defines the router’s name and
|
||||
* `<RELEASE_NAME>` - Unique name for the App Gateway within the K8s namespace. This is a required parameter in
|
||||
Helm, which identifies a specific installation of the chart. The release name also defines the App Gateway's name and
|
||||
appears in the UI within AI workload application URLs (e.g. Remote IDE URLs). This can be customized to support multiple installations within the same
|
||||
namespace by assigning different release names.
|
||||
* `<WORKLOAD_NAMESPACE>` - [Kubernetes Namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/)
|
||||
where workloads will be executed. This namespace must be shared between a dedicated ClearML Agent and an App
|
||||
Gateway Router. The agent is responsible for monitoring its assigned task queues and spawning workloads within this
|
||||
namespace. The router monitors the same namespace for AI workloads (e.g. remote IDE applications). The router has a
|
||||
Gateway. The agent is responsible for monitoring its assigned task queues and spawning workloads within this
|
||||
namespace. The App Gateway monitors the same namespace for AI workloads (e.g. remote IDE applications). The App Gateway has a
|
||||
namespace-limited scope, meaning it can only detect and manage tasks within its
|
||||
assigned namespace.
|
||||
* `<CHART_VERSION>` - Version recommended by the ClearML Support Team.
|
@ -2,408 +2,168 @@
|
||||
title: Kubernetes
|
||||
---
|
||||
|
||||
This guide provides step-by-step instructions for installing the ClearML Enterprise Server (control-plane) in a Kubernetes cluster.
|
||||
|
||||
This guide provides step-by-step instructions for installing the ClearML Enterprise setup in a Kubernetes cluster.
|
||||
The ClearML Enterprise Server includes the ClearML `apiserver`, `fileserver`, and `webserver` components.
|
||||
The package also includes MongoDB, ElasticSearch, and Redis as Helm dependencies.
|
||||
|
||||
|
||||
## Prerequisites
|
||||
|
||||
To deploy a ClearML Server, ensure the following components and configurations are in place:
|
||||
|
||||
* A Kubernetes cluster
|
||||
* An ingress controller (e.g. `nginx-ingress`) and the ability to create LoadBalancer services (e.g. MetalLB) if needed
|
||||
to expose ClearML
|
||||
* Credentials for ClearML Enterprise GitHub Helm chart repository
|
||||
* Credentials for ClearML Enterprise DockerHub repository
|
||||
* URL for downloading the ClearML Enterprise applications configuration
|
||||
- Kubernetes Cluster: A standard Kubernetes cluster is preferred for optimal GPU support.
|
||||
- CLI Tools: `kubectl` and `helm` must be installed and configured.
|
||||
- Ingress Controller: An Ingress controller (e.g., `nginx-ingress`) is required. If exposing services externally, a
|
||||
LoadBalancer-capable solution (e.g. `MetalLB`) should also be configured.
|
||||
- Server and workers that communicate on HTTP/S (ports 80 and 443). Additionally, the TCP session feature requires a
|
||||
range of ports for TCP traffic based on your configuration (see [AI App Gateway installation](appgw_install_k8s.md)).
|
||||
- DNS Configuration: A domain with subdomain support is required, ideally with trusted TLS certificates. All entries must
|
||||
be resolvable by the Ingress controller. Example subdomains:
|
||||
- Server:
|
||||
- `api.<BASE_DOMAIN>`
|
||||
- `app.<BASE_DOMAIN>`
|
||||
- `files.<BASE_DOMAIN>`
|
||||
- Worker:
|
||||
- `router.<BASE_DOMAIN>`
|
||||
- `tcp-router.<BASE_DOMAIN>` (optional, for TCP sessions)
|
||||
- Storage: A configured StorageClass and an accessible storage backend.
|
||||
- ClearML Enterprise Access:
|
||||
- Helm repository credentials (`<HELM_REPO_TOKEN>`)
|
||||
- DockerHub registry credentials (`<CLEARML_DOCKERHUB_TOKEN>`)
|
||||
|
||||
### Recommended Cluster Specifications
|
||||
|
||||
## Control Plane Installation
|
||||
For optimal performance, a Kubernetes cluster with at least 3 nodes is recommended, each having:
|
||||
|
||||
- 8 vCPUs
|
||||
- 32 GB RAM
|
||||
- 500 GB storage
|
||||
|
||||
The following steps cover installing the control plane (server and required charts) and will
|
||||
require some or all of the tokens/deliverables mentioned above.
|
||||
## Installation
|
||||
|
||||
### Add the Helm Repo Locally
|
||||
|
||||
### Requirements
|
||||
Add the ClearML Helm repository:
|
||||
|
||||
``` bash
|
||||
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <HELM_REPO_TOKEN> --password <HELM_REPO_TOKEN>
|
||||
```
|
||||
|
||||
* Add the ClearML Enterprise repository:
|
||||
Update the repository locally:
|
||||
``` bash
|
||||
helm repo update
|
||||
```
|
||||
|
||||
### Prepare Values
|
||||
|
||||
```
|
||||
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <clearmlenterprise_GitHub_TOKEN> --password <clearmlenterprise_GitHub_TOKEN>
|
||||
```
|
||||
|
||||
|
||||
* Update the repository locally:
|
||||
|
||||
|
||||
```
|
||||
helm repo update
|
||||
```
|
||||
|
||||
|
||||
### Install ClearML Enterprise Chart
|
||||
|
||||
|
||||
#### Configuration
|
||||
|
||||
|
||||
The Helm Chart must be installed with an `overrides.yaml` overriding values as follows:
|
||||
|
||||
Create a `clearml-values.override.yaml` file with the following content:
|
||||
|
||||
:::note
|
||||
In the following configuration, replace `<BASE_DOMAIN>` with a valid domain
|
||||
that will have records pointing to the cluster’s ingress controller (see ingress details in the values below).
|
||||
In the following configuration, replace the `<BASE_DOMAIN>` placeholders with a valid domain that will have records
|
||||
pointing to the cluster's Ingress Controller. This will be the base domain for reaching your ClearML installation.
|
||||
:::
|
||||
|
||||
|
||||
```yaml
|
||||
``` yaml
|
||||
imageCredentials:
|
||||
password: "<clearml_enterprise_DockerHub_TOKEN>"
|
||||
|
||||
password: "<CLEARML_DOCKERHUB_TOKEN>"
|
||||
clearml:
|
||||
cookieDomain: "<BASE_DOMAIN>"
|
||||
# Set values for improved security
|
||||
apiserverKey: "<GENERATED_API_SERVER_KEY>"
|
||||
apiserverSecret: "<GENERATED_API_SERVER_SECRET>"
|
||||
fileserverKey: "<GENERATED_FILE_SERVER_KEY>"
|
||||
fileserverSecret: "<GENERATED_FILE_SERVER_SECRET>"
|
||||
secureAuthTokenSecret: "<GENERATED_AUTH_TOKEN_SECRET>"
|
||||
testUserKey: "<GENERATED_TEST_USER_KEY>"
|
||||
testUserSecret: "<GENERATED_TEST_USER_SECRET>"
|
||||
|
||||
apiserver:
|
||||
ingress:
|
||||
enabled: true
|
||||
hostName: "api.<BASE_DOMAIN>"
|
||||
service:
|
||||
type: ClusterIP
|
||||
|
||||
fileserver:
|
||||
ingress:
|
||||
enabled: true
|
||||
hostName: "file.<BASE_DOMAIN>"
|
||||
hostName: "files.<BASE_DOMAIN>"
|
||||
service:
|
||||
type: ClusterIP
|
||||
|
||||
webserver:
|
||||
ingress:
|
||||
enabled: true
|
||||
hostName: "app.<BASE_DOMAIN>"
|
||||
service:
|
||||
type: ClusterIP
|
||||
|
||||
clearmlApplications:
|
||||
enabled: true
|
||||
```
|
||||
|
||||
#### Additional Configuration Options
|
||||
##### Fixed Users (Simple Login)
|
||||
### Install the Chart
|
||||
|
||||
Install the ClearML Enterprise Helm chart using the previous values override file.
|
||||
|
||||
Enable static login with username and password in `overrides.yaml`.
|
||||
|
||||
|
||||
This is an optional step in case SSO (Identity provider) configuration will not be performed.
|
||||
|
||||
|
||||
```
|
||||
apiserver:
|
||||
additionalConfigs:
|
||||
apiserver.conf: |
|
||||
auth {
|
||||
fixed_users {
|
||||
enabled: true
|
||||
pass_hashed: false
|
||||
users: [
|
||||
{
|
||||
username: "my_user"
|
||||
password: "my_password"
|
||||
name: "My User"
|
||||
admin: true
|
||||
},
|
||||
]
|
||||
}
|
||||
}
|
||||
``` bash
|
||||
helm upgrade -i -n clearml clearml clearml-enterprise/clearml-enterprise --create-namespace -f clearml-values.override.yaml
|
||||
```
|
||||
|
||||
## Additional Configuration Options
|
||||
|
||||
##### SSO (Identity Provider)
|
||||
:::note
|
||||
You can view the full set of available and documented values of the chart by running the following command:
|
||||
|
||||
|
||||
The following examples (Auth0 and Keycloak) show how to configure an identity provider on the ClearML server.
|
||||
|
||||
|
||||
Add the following values configuring `extraEnvs` for `apiserver` in the `clearml-enterprise` values `override.yaml` file.
|
||||
|
||||
|
||||
Substitute all `<PLACEHOLDER>`s with the correct value for your configuration.
|
||||
|
||||
|
||||
##### Auth0 Identity Provider
|
||||
|
||||
|
||||
```yaml
|
||||
apiserver:
|
||||
extraEnvs:
|
||||
- name: CLEARML__secure__login__sso__oauth_client__auth0__client_id
|
||||
value: "<SSO_CLIENT_ID>"
|
||||
- name: CLEARML__secure__login__sso__oauth_client__auth0__client_secret
|
||||
value: "<SSO_CLIENT_SECRET>"
|
||||
- name: CLEARML__services__login__sso__oauth_client__auth0__base_url
|
||||
value: "<SSO_CLIENT_URL>"
|
||||
- name: CLEARML__services__login__sso__oauth_client__auth0__authorize_url
|
||||
value: "<SSO_CLIENT_AUTHORIZE_URL>"
|
||||
- name: CLEARML__services__login__sso__oauth_client__auth0__access_token_url
|
||||
value: "<SSO_CLIENT_ACCESS_TOKEN_URL>"
|
||||
- name: CLEARML__services__login__sso__oauth_client__auth0__audience
|
||||
value: "<SSO_CLIENT_AUDIENCE>"
|
||||
```bash
|
||||
helm show readme clearml-enterprise/clearml-enterprise
|
||||
# or
|
||||
helm show values clearml-enterprise/clearml-enterprise
|
||||
```
|
||||
:::
|
||||
|
||||
### Default Secret Values
|
||||
|
||||
##### Keycloak Identity Provider
|
||||
For improved security, all the internal credentials are auto-generated randomly and stored in a Secret in
|
||||
Kubernetes.
|
||||
|
||||
If you need to define your own credentials to be used instead, replace the default key and secret values in `clearml-values.override.yaml`.
|
||||
|
||||
```yaml
|
||||
apiserver:
|
||||
extraEnvs:
|
||||
- name: CLEARML__secure__login__sso__oauth_client__keycloak__client_id
|
||||
value: "<KC_CLIENT_ID>"
|
||||
- name: CLEARML__secure__login__sso__oauth_client__keycloak__client_secret
|
||||
value: "<KC_SECRET_ID>"
|
||||
- name: CLEARML__services__login__sso__oauth_client__keycloak__base_url
|
||||
value: "<KC_URL>/realms/<REALM_NAME>/"
|
||||
- name: CLEARML__services__login__sso__oauth_client__keycloak__authorize_url
|
||||
value: "<KC_URL>/realms/<REALM_NAME>/protocol/openid-connect/auth"
|
||||
- name: CLEARML__services__login__sso__oauth_client__keycloak__access_token_url
|
||||
value: "<KC_URL>/realms/<REALM_NAME>/protocol/openid-connect/token"
|
||||
- name: CLEARML__services__login__sso__oauth_client__keycloak__idp_logout
|
||||
value: "true"
|
||||
```
|
||||
|
||||
|
||||
#### Installing the Chart
|
||||
|
||||
|
||||
```
|
||||
helm install -n clearml \
|
||||
clearml \
|
||||
clearml-enterprise/clearml-enterprise \
|
||||
--create-namespace \
|
||||
-f overrides.yaml
|
||||
```
|
||||
|
||||
|
||||
### Install ClearML Agent Chart
|
||||
|
||||
|
||||
#### Configuration
|
||||
|
||||
|
||||
To configure the agent you will need to choose a Redis password and use that when setting up Redis as well
|
||||
(see [Shared Redis installation](multi_tenant_k8s.md#shared-redis-installation)).
|
||||
|
||||
|
||||
The Helm Chart must be installed with `overrides.yaml`:
|
||||
|
||||
|
||||
```yaml
|
||||
imageCredentials:
|
||||
password: "<CLEARML_DOCKERHUB_TOKEN>"
|
||||
``` yaml
|
||||
clearml:
|
||||
agentk8sglueKey: "<ACCESS_KEY>"
|
||||
agentk8sglueSecret: "<SECRET_KEY>"
|
||||
agentk8sglue:
|
||||
apiServerUrlReference: "https://api.<BASE_DOMAIN>"
|
||||
fileServerUrlReference: "https://files.<BASE_DOMAIN>"
|
||||
webServerUrlReference: "https://app.<BASE_DOMAIN>"
|
||||
defaultContainerImage: "python:3.9"
|
||||
# Replace the following values to use custom internal credentials.
|
||||
apiserverKey: ""
|
||||
apiserverSecret: ""
|
||||
fileserverKey: ""
|
||||
fileserverSecret: ""
|
||||
secureAuthTokenSecret: ""
|
||||
testUserKey: ""
|
||||
testUserSecret: ""
|
||||
```
|
||||
|
||||
In a shell, if `openssl` is installed, you can use this simple command to generate random strings suitable as keys and secrets:
|
||||
|
||||
#### Installing the Chart
|
||||
|
||||
|
||||
```bash
|
||||
helm install -n <WORKLOAD_NAMESPACE> \
|
||||
clearml-agent \
|
||||
clearml-enterprise/clearml-enterprise-agent \
|
||||
--create-namespace \
|
||||
-f overrides.yaml
|
||||
``` bash
|
||||
openssl rand -hex 16
|
||||
```
|
||||
|
||||
### Fixed Users
|
||||
|
||||
To create a queue by API:
|
||||
Enable and configure simple login with username and password in `clearml-values.override.yaml`. This is useful for simple PoC
|
||||
installations. This is an optional step in case the SSO (Identity provider) configuration is not performed.
|
||||
|
||||
Please note that this setup is not ideal for multi-tenant setups as fixed users will only be associated with the default tenant.
|
||||
|
||||
```bash
|
||||
curl $APISERVER_URL/queues.create \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "X-Clearml-Impersonate-As:<USER_ID>" \
|
||||
-u $APISERVER_KEY:$APISERVER_SECRET \
|
||||
-d '{"name":"default"}'
|
||||
``` yaml
|
||||
apiserver:
|
||||
additionalConfigs:
|
||||
apiserver.conf: |
|
||||
auth {
|
||||
fixed_users {
|
||||
enabled: true
|
||||
pass_hashed: false
|
||||
users: [
|
||||
{
|
||||
username: "my_user"
|
||||
password: "my_password"
|
||||
name: "My User"
|
||||
admin: true
|
||||
},
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
## ClearML AI Application Gateway Installation
|
||||
|
||||
|
||||
### Configuring Chart
|
||||
|
||||
|
||||
The Helm Chart must be installed with `overrides.yaml`:
|
||||
|
||||
|
||||
```yaml
|
||||
imageCredentials:
|
||||
password: "<DOCKERHUB_TOKEN>"
|
||||
clearml:
|
||||
apiServerKey: ""
|
||||
apiServerSecret: ""
|
||||
apiServerUrlReference: "https://api."
|
||||
authCookieName: ""
|
||||
ingress:
|
||||
enabled: true
|
||||
hostName: "task-router.dev"
|
||||
tcpSession:
|
||||
routerAddress: "<NODE_IP OR EXTERNAL_NAME>"
|
||||
portRange:
|
||||
start: <START_PORT>
|
||||
end: <END_PORT>
|
||||
```
|
||||
|
||||
|
||||
**Configuration options:**
|
||||
|
||||
|
||||
* **`clearml.apiServerUrlReference`:** URL usually starting with `https://api.`
|
||||
* **`clearml.apiServerKey`:** ClearML server API key
|
||||
* **`clearml.apiServerSecret`:** ClearML server secret key
|
||||
* **`ingress.hostName`:** URL of the router we configured previously for load balancer starting with `https://`
|
||||
* **`clearml.sslVerify`:** Enable or disable SSL certificate validation on apiserver calls check
|
||||
* **`clearml.authCookieName`:** Value from `value_prefix` key starting with `allegro_token` in `envoy.yaml` file in ClearML server installation.
|
||||
* **`tcpSession.routerAddress`**: Router external address can be an IP or the host machine or a load balancer hostname, depends on the network configuration
|
||||
* **`tcpSession.portRange.start`**: Start port for the TCP Session feature
|
||||
* **`tcpSession.portRange.end`**: End port for the TCP Session feature
|
||||
|
||||
|
||||
### Installing the Chart
|
||||
|
||||
|
||||
```bash
|
||||
helm install -n <WORKLOAD_NAMESPACE> \
|
||||
clearml-ttr \
|
||||
clearml-enterprise/clearml-enterprise-task-traffic-router \
|
||||
--create-namespace \
|
||||
-f overrides.yaml
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
## Applications Installation
|
||||
|
||||
|
||||
To install the ClearML Applications on the newly installed ClearML Enterprise control-plane, download the applications
|
||||
package using the URL provided by the ClearML staff.
|
||||
|
||||
|
||||
|
||||
|
||||
### Download and Extract
|
||||
|
||||
|
||||
```
|
||||
wget -O apps.zip "<ClearML enterprise applications configuration download url>"
|
||||
unzip apps.zip
|
||||
```
|
||||
|
||||
|
||||
### Adjust Application Docker Images Location (Air-Gapped Systems)
|
||||
|
||||
|
||||
ClearML applications use pre-built docker images provided by ClearML on the ClearML DockerHub
|
||||
repository. If you are using an air-gapped system, these images must be available as part of your internal docker
|
||||
registry, and the correct docker images location must be specified before installing the applications.
|
||||
|
||||
|
||||
Use the following script to adjust the applications packages accordingly before installing the applications:
|
||||
|
||||
|
||||
```
|
||||
python convert_image_registry.py \
|
||||
--apps-dir /path/to/apps/ \
|
||||
--repo local_registry/clearml-apps
|
||||
```
|
||||
|
||||
|
||||
The script will change the application zip files to point to the new registry, and will output the list of containers
|
||||
that need to be copied to the local registry. For example:
|
||||
|
||||
|
||||
```
|
||||
make sure allegroai/clearml-apps:hpo-1.10.0-1062 was added to local_registry/clearml-apps
|
||||
```
|
||||
|
||||
|
||||
### Install Applications
|
||||
|
||||
|
||||
Use the `upload_apps.py` script to upload the application packages to the ClearML server:
|
||||
|
||||
|
||||
```
|
||||
python upload_apps.py \
|
||||
--host $APISERVER_ADDRESS \
|
||||
--user $APISERVER_USER --password $APISERVER_PASSWORD \
|
||||
--dir apps -ml
|
||||
```
|
||||
|
||||
|
||||
## Configuring Shared Memory for Large Model Deployment
|
||||
|
||||
|
||||
Deploying large models may fail due to shared memory size limitations. This issue commonly arises when the allocated
|
||||
`/dev/shm` space is insufficient.:
|
||||
|
||||
|
||||
```
|
||||
> 3d3e22c3066f:168:168 [0] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-UbzKZ9 to 9637892 bytes
|
||||
> 3d3e22c3066f:168:168 [0] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-UbzKZ9 (size 9637888)
|
||||
> 3d3e22c3066f:168:168 [0] NCCL INFO transport/shm.cc:114 -> 2
|
||||
> 3d3e22c3066f:168:168 [0] NCCL INFO transport.cc:33 -> 2
|
||||
> 3d3e22c3066f:168:168 [0] NCCL INFO transport.cc:113 -> 2
|
||||
> 3d3e22c3066f:168:168 [0] NCCL INFO init.cc:1263 -> 2
|
||||
> 3d3e22c3066f:168:168 [0] NCCL INFO init.cc:1548 -> 2
|
||||
> 3d3e22c3066f:168:168 [0] NCCL INFO init.cc:1799 -> 2
|
||||
```
|
||||
|
||||
|
||||
To configure a proper SHM size you can use the following configuration in the agent `overrides.yaml`.
|
||||
|
||||
|
||||
Replace `<SIZE>` with the desired memory allocation in GiB, based on your model requirements.
|
||||
|
||||
|
||||
This example configures a specific queue, but you can include this setting in the `basePodTemplate` if you need to
|
||||
apply it to all tasks.
|
||||
|
||||
|
||||
```yaml
|
||||
agentk8sglue:
|
||||
queues:
|
||||
GPUshm:
|
||||
templateOverrides:
|
||||
env:
|
||||
- name: VLLM_SKIP_P2P_CHECK
|
||||
value: "1"
|
||||
volumeMounts:
|
||||
- name: dshm
|
||||
mountPath: /dev/shm
|
||||
volumes:
|
||||
- name: dshm
|
||||
emptyDir:
|
||||
medium: Memory
|
||||
sizeLimit: <SIZE>Gi
|
||||
```
|
||||
Once the ClearML Enterprise Server is up and running, proceed with installing the ClearML Enterprise Agent and
|
||||
[AI App Gateway](appgw_install_k8s.md).
|
||||
|
Loading…
Reference in New Issue
Block a user