From aaa3851de3bf061c48ece7551945742d6966ef21 Mon Sep 17 00:00:00 2001 From: revital Date: Thu, 15 May 2025 14:55:35 +0300 Subject: [PATCH] Edits --- .../enterprise_deploy/agent_k8s.md | 42 ++++++------- .../dynamic_edit_task_pod_template.md | 29 ++++----- .../extra_configs/gpu_operator.md | 10 +-- .../extra_configs/multi_node_training.md | 3 +- .../extra_configs/presign_service.md | 10 +-- .../extra_configs/self_signed_certificates.md | 33 +++++----- .../extra_configs/sso_login.md | 21 ++++--- .../enterprise_deploy/fractional_gpus/cdmo.md | 28 ++++----- .../fractional_gpus/cdmo_cfgi_same_cluster.md | 48 +++++++------- .../enterprise_deploy/fractional_gpus/cfgi.md | 62 ++++++++++--------- 10 files changed, 139 insertions(+), 147 deletions(-) diff --git a/docs/deploying_clearml/enterprise_deploy/agent_k8s.md b/docs/deploying_clearml/enterprise_deploy/agent_k8s.md index 15ec5e58..e2c2c4dc 100644 --- a/docs/deploying_clearml/enterprise_deploy/agent_k8s.md +++ b/docs/deploying_clearml/enterprise_deploy/agent_k8s.md @@ -6,8 +6,8 @@ The ClearML Agent enables scheduling and executing distributed experiments on a ## Prerequisites -- A ClearML Enterprise server is up and running. -- Generate a set of `` and `` credentials in the ClearML Server. The easiest way is via +- A running [ClearML Enterprise Server](k8s.md) +- API credentials (`` and ``) generated via the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials). :::note @@ -15,7 +15,7 @@ The ClearML Agent enables scheduling and executing distributed experiments on a ::: - The worker environment must be able to access the ClearML Server over the same network. -- * Helm token to access `clearml-enterprise` helm-chart repo +- Helm token to access `clearml-enterprise` Helm chart repo ## Installation @@ -36,9 +36,9 @@ helm repo update Create a `clearml-agent-values.override.yaml` file with the following content: :::note -Replace the `` and `` with the admin credentials -you created earlier. Set `ServerUrlReference` to the relevant URLs of your ClearML -Server installation. +Replace the `` and ``with the API credentials you generated earlier. +Set the `ServerUrlReference` fields to match your ClearML +Server URLs. ::: ```yaml @@ -60,7 +60,7 @@ agentk8sglue: ### Install the Chart -Install the ClearML Enterprise Agent Helm chart using the previous values override file: +Install the ClearML Enterprise Agent Helm chart: ```bash helm upgrade -i -n clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml @@ -68,7 +68,7 @@ helm upgrade -i -n clearml-agent clearml-enterprise/clearml-e ## Additional Configuration Options -To view all configurable options for the Helm chart, run the following command: +To view available configuration options for the Helm chart, run the following command: ```bash helm show readme clearml-enterprise/clearml-enterprise-agent @@ -76,7 +76,7 @@ helm show readme clearml-enterprise/clearml-enterprise-agent helm show values clearml-enterprise/clearml-enterprise-agent ``` -### Set GPU Availability in Orchestration Dashboard +### Reporting GPU Availability to Orchestration Dashboard To show GPU availability in the [Orchestration Dashboard](../../webapp/webapp_orchestration_dash.md), explicitly set the number of GPUs: @@ -88,25 +88,22 @@ agentk8sglue: ### Queues -The ClearML Agent monitors ClearML queues and pulls tasks that are scheduled for execution. +The ClearML Agent monitors [ClearML queues](../../fundamentals/agents_and_queues.md) and pulls tasks that are +scheduled for execution. -A single agent can monitor multiple queues. By default, the queues share a base pod template (`agentk8sglue.basePodTemplate`) -used when submitting a task to Kubernetes after it has been extracted from the queue. +A single agent can monitor multiple queues. By default, all queues share a base pod template (`agentk8sglue.basePodTemplate`) +used when launching tasks on Kubernetes after it has been pulled from the queue. -Each queue can be configured with dedicated Pod template spec override (`templateOverrides`). This way queue definitions -can be tailored to different use cases. +Each queue can be configured to override the base pod template with its own settings with a `templateOverrides` queue template. +This way queue definitions can be tailored to different use cases. The following are a few examples of agent queue templates: #### Example: GPU Queues +To support GPU queues, you must deploy the NVIDIA GPU Operator on your Kubernetes cluster. For more information, see [GPU Operator](extra_configs/gpu_operator.md). -GPU queue support requires deploying the NVIDIA GPU Operator on your Kubernetes cluster. - -For more information, see [GPU Operator](extra_configs/gpu_operator.md). - - -``` yaml +```yaml agentk8sglue: createQueues: true queues: @@ -122,8 +119,9 @@ agentk8sglue: nvidia.com/gpu: 2 ``` -#### Example: Overriding Pod Templates per Queue +#### Example: Custom Pod Template per Queue +This example demonstrates how to override the base pod template definitions on a per-queue basis. In this example: - The `red` queue inherits both the label `team=red` and the 1Gi memory limit from the `basePodTemplate` section. @@ -167,5 +165,5 @@ agentk8sglue: ## Next Steps -Once the agent is up and running, proceed with deploying the[ ClearML Enterprise App Gateway](appgw_install_k8s.md). +Once the agent is up and running, proceed with deploying the [ClearML Enterprise App Gateway](appgw_install_k8s.md). diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/dynamic_edit_task_pod_template.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/dynamic_edit_task_pod_template.md index bf33258e..911ff411 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/dynamic_edit_task_pod_template.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/dynamic_edit_task_pod_template.md @@ -2,18 +2,15 @@ title: Dynamically Edit Task Pod Template --- -The ClearML Enterprise Agent supports defining custom Python code to modify a task's Pod template before it is applied -to Kubernetes. +ClearML Agent allows you to inject custom Python code to dynamically modify the Kubernetes Pod template before applying it. -This enables dynamic customization of Task Pod manifests in the context of a ClearML Enterprise Agent, which is useful -for injecting values or changing configurations based on runtime context. ## Agent Configuration The `CLEARML_K8S_GLUE_TEMPLATE_MODULE` environment variable defines the Python module and function inside that -module that the ClearML Enterprise Agent should invoke before applying a Task Pod template. +module to be invoked by the agent before applying a task pod template. -The Agent will run this code in its own context, pass arguments (including the actual template) to the function, and use +The agent will run this code in its own context, pass arguments (including the actual template) to the function, and use the returned template to create the final Task Pod in Kubernetes. Arguments passed to the function include: @@ -60,13 +57,13 @@ agentk8sglue: ``` :::note notes -* Make sure to include `*args, **kwargs` at the end of the function's argument list and to only use keyword arguments. +* Always include `*args, **kwargs` at the end of the function's argument list and only use keyword arguments. This is needed to maintain backward compatibility. * Custom code modules can be included as a file in the pod's container, and the environment variable can be used to point to the file and entry point. -* When defining a custom code module, by default the Agent will start watching pods in all namespaces +* When defining a custom code module, by default the agent will start watching pods in all namespaces across the cluster. If you do not intend to give a `ClusterRole` permission, make sure to set the `CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES` env to `"0"` to prevent the Agent to try listing pods in all namespaces. Instead, set it to `"1"` if namespace-related changes are needed in the code. @@ -80,13 +77,13 @@ agentk8sglue: To customize the bash startup scripts instead of the pod spec, use: -```yaml -agentk8sglue: - # -- Custom Bash script for the Agent pod ran by Glue Agent - customBashScript: "" - # -- Custom Bash script for the Task Pods ran by Glue Agent - containerCustomBashScript: "" -``` + ```yaml + agentk8sglue: + # -- Custom Bash script for the Agent pod ran by Glue Agent + customBashScript: "" + # -- Custom Bash script for the Task Pods ran by Glue Agent + containerCustomBashScript: "" + ``` ## Examples @@ -167,7 +164,7 @@ agentk8sglue: ### Example: Bind PVC Resource to Task Pod -In this example, a PVC is created and attached to every Pod created from a dedicated queue, then deleted afterwards. +In this example, a PVC is created and attached to every pod created from a dedicated queue, then it is deleted. Key points: diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/gpu_operator.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/gpu_operator.md index d4cc3386..97b59e75 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/gpu_operator.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/gpu_operator.md @@ -2,10 +2,12 @@ title: Basic Deployment - Suggested GPU Operator Values --- +This guide provides recommended configuration values for deploying the NVIDIA GPU Operator alongside ClearML Enterprise. ## Add the Helm Repo Locally Add the NVIDIA GPU Operator Helm repository: + ```bash helm repo add nvidia https://nvidia.github.io/gpu-operator ``` @@ -17,10 +19,8 @@ helm repo update ## Installation -To prevent unprivileged containers from bypassing the Kubernetes Device Plugin API, configure the GPU operator with the -following override values. - -Create a `gpu-operator.override.yaml` file with the following content: +To prevent unprivileged containers from bypassing the Kubernetes Device Plugin API, configure the GPU operator +using the following `gpu-operator.override.yaml` file: ```yaml toolkit: @@ -53,7 +53,7 @@ helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace ## Fractional GPU Support -For support with fractional GPUs, refer to the dedicated guides: +To enable fractional GPU allocation or manage mixed GPU configurations, refer to the following guides: * [ClearML Dynamic MIG Operator](../fractional_gpus/cdmo.md) (CDMO) – Dynamically configures MIG GPUs on supported devices. * [ClearML Enterprise Fractional GPU Injector](../fractional_gpus/cfgi.md) (CFGI) – Enables fractional (non-MIG) GPU allocation for better hardware utilization and workload distribution in Kubernetes. diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/multi_node_training.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/multi_node_training.md index d033464b..026a39d6 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/multi_node_training.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/multi_node_training.md @@ -2,7 +2,8 @@ title: Multi-Node Training --- -The ClearML Enterprise Agent supports horizontal multi-node training--running a single Task across multiple Pods on different nodes. +The ClearML Enterprise Agent supports horizontal multi-node training, allowing a single Task to run across multiple pods +on different nodes. Below is a configuration example using `clearml-agent-values.override.yaml`: diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md index 1188fbcf..80a57359 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/presign_service.md @@ -7,16 +7,16 @@ users, enabling direct access to cloud-hosted data (e.g., S3) without exposing c ## Prerequisites -- A ClearML Enterprise server is up and running. -- Generate `` and `` credentials in the ClearML Server. The easiest way is via the ClearML UI - (**Settings > Workspace > App Credentials > Create new credentials**). +- A ClearML Enterprise Server is up and running. +- API credentials (`` and ``) generated via + the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials). :::note Make sure these credentials belong to an admin user or a service user with admin privileges. ::: - The worker environment must be able to access the ClearML Server over the same network. - +- Token to access `clearml-enterprise` Helm chart repo ## Installation @@ -50,7 +50,7 @@ ingress: ### Deploy the Helm Chart -Install the `clearml-presign-service` helm chart in the same namespace as the ClearML Enterprise server: +Install the `clearml-presign-service` Helm chart in the same namespace as the ClearML Enterprise server: ```bash helm install -n clearml clearml-presign-service clearml-enterprise/clearml-presign-service -f presign-service.override.yaml diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/self_signed_certificates.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/self_signed_certificates.md index a63aacbb..741f5636 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/self_signed_certificates.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/self_signed_certificates.md @@ -2,10 +2,8 @@ title: ClearML Tenant with Self Signed Certificates --- -This guide covers the configuration to support SSL Custom certificates for the following components: - -- ClearML Enterprise AI Application Gateway -- ClearML Enterprise Agent +This guide covers how to configure the [AI Application Gateway](#ai-application-gateway) and [ClearML Agent](#clearml-agent) +to use self-signed or custom SSL certificates. ## AI Application Gateway @@ -25,15 +23,15 @@ customCertificates: -----END CERTIFICATE----- ``` -In this section, there are two options: +You have two configuration options: -- [**Replace**](#replace-the-whole-ca-certificatescrt-file) the entire `ca-certificates.crt` file +- [**Replace**](#replace-entire-ca-certificatescrt-file) the entire `ca-certificates.crt` file - [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt) extra certificates to the existing `ca-certificates.crt` ### Replace Entire `ca-certificates.crt` File -To replace the whole ca-bundle, you can attach a concatenation of all your valid CA in a `pem` format as +To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as they are stored in a standard `ca-certificates.crt`. ```yaml @@ -55,7 +53,7 @@ customCertificates: ### Append Extra Certificates to the Existing `ca-certificates.crt` -You can add certificates to the existing CA bundle. Ensure each certificate has a unique `alias`. +You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`. ```yaml # -- Custom certificates @@ -82,9 +80,9 @@ To apply the changes, run the update command: helm upgrade -i -n clearml-enterprise/clearml-enterprise-app-gateway --version -f clearml-app-gateway-values.override.yaml ``` -## ClearML Enterprise Agent +## ClearML Agent -For the Agent, configure certificates in the `clearml-agent-values.override.yaml` file: +For the ClearML Agent, configure certificates in the `clearml-agent-values.override.yaml` file: ```yaml # -- Custom certificates @@ -100,17 +98,18 @@ customCertificates: -----END CERTIFICATE----- ``` -In the section, there are two options: +You have two configuration options: -- [**Replace**](#replace-the-whole-ca-certificatescrt-file) the entire `ca-certificates.crt` file -- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt) extra certificates to the existing `ca-certificates.crt` +- [**Replace**](#replace-entire-ca-certificatescrt-file-1) the entire `ca-certificates.crt` file +- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt-1) extra certificates to the existing `ca-certificates.crt` ### Replace Entire `ca-certificates.crt` File -If you need to replace the whole ca-bundle you can attach a concatenation of all your valid CA in a `pem` format like +To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as they are stored in a standard `ca-certificates.crt`. + ```yaml # -- Custom certificates customCertificates: @@ -130,7 +129,7 @@ customCertificates: ### Append Extra Certificates to the Existing `ca-certificates.crt` -You can add certificates to the existing CA bundle. Ensure each certificate has a unique `alias`. +You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`. ```yaml # -- Custom certificates @@ -151,7 +150,7 @@ customCertificates: ### Add Certificates to Task Pods -If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into Pods: +If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into pods: ```yaml agentk8sglue: @@ -195,7 +194,7 @@ Their names are usually prefixed with the Helm release name, so adjust according ### Apply Changes -Applying the changes by running the the update command: +Apply the changes by running the update command: ``` bash helm upgrade -i -n clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml diff --git a/docs/deploying_clearml/enterprise_deploy/extra_configs/sso_login.md b/docs/deploying_clearml/enterprise_deploy/extra_configs/sso_login.md index 5c4a7fc7..6e9179e5 100644 --- a/docs/deploying_clearml/enterprise_deploy/extra_configs/sso_login.md +++ b/docs/deploying_clearml/enterprise_deploy/extra_configs/sso_login.md @@ -3,7 +3,8 @@ title: SSO (Identity Provider) Setup --- ClearML Enterprise Server supports various Single Sign-On (SSO) identity providers. -SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and applied to the `apiserver` component. +SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and is applied to the +`apiserver` component. The following are configuration examples for commonly used providers. Other supported systems include: * Auth0 @@ -11,7 +12,7 @@ The following are configuration examples for commonly used providers. Other supp * Okta * Azure AD * Google -* and AWS Cognito +* AWS Cognito ## Auth0 @@ -56,17 +57,17 @@ apiserver: To map Keycloak groups into the ClearML user's SSO token: 1. Go to the **Client Scopes** tab. -1. Click on the first row `-dedicated`. -1. Click **Add Mapper > By configuration > Group membership** -1. In the dialog: - * select the **Name** "groups" +1. Click on the `-dedicated` scope. +1. Click **Add Mapper > By Configuration > Group Membership** +1. Configure the mapper: + * Select the **Name** "groups" * Set **Token Claim Name** "groups" * Uncheck the **Full group path** * Save the mapper. To verify: -1. Return to **Client Details > Client scope** tab. -1. Go to the Evaluate sub-tab and select a user who has any group memberships. -1. Go to **Generated ID token** and then to **Generated User Info**. -1Inspect that in both cases you can see the group's claim in the displayed user data. +1. Go to the **Client Details > Client scope** tab. +1. Go to the **Evaluate** sub-tab and select a user with any group memberships. +1. Go to **Generated ID Token** and then to **Generated User Info**. +1. Inspect that in both cases you can see the group's claim in the displayed user data. diff --git a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md index 1aa62c91..138792fe 100644 --- a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md +++ b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo.md @@ -2,14 +2,12 @@ title: ClearML Dynamic MIG Operator (CDMO) --- -The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations. +The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG (Multi-Instance GPU) configurations. ## Installation ### Requirements -* Install the official NVIDIA `gpu-operator` using Helm with one of the following configurations. - * Add and update the Nvidia Helm repo: ```bash @@ -46,7 +44,7 @@ The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations. value: all ``` -* Install the official NVIDIA `gpu-operator` using Helm with the previous configuration: +* Install the NVIDIA `gpu-operator` using Helm with the previous configuration: ```bash helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml @@ -54,33 +52,33 @@ The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations. ### Installing CDMO -* Create a `cdmo-values.override.yaml` file with the following content: +1. Create a `cdmo-values.override.yaml` file with the following content: ```yaml imageCredentials: password: "" ``` -* Install the CDMO Helm Chart using the previous override file: +1. Install the CDMO Helm Chart using the previous override file: ```bash helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml ``` -* Enable the NVIDIA MIG support on your cluster by running the following command on all nodes with a MIG-supported GPU +1. Enable the NVIDIA MIG support on your cluster by running the following command on all nodes with a MIG-supported GPU (run it for each GPU `` on the host): ```bash nvidia-smi -mig 1 ``` -:::note notes -* The node reboot may be required if the command output indicates so. + :::note notes + * A node reboot may be required if the command output indicates so. + + * For convenience, this command can be run from within the `nvidia-device-plugin-daemonset` pod running on the related node. + ::: -* For convenience, this command can be issued from inside the `nvidia-device-plugin-daemonset` pod running on the related node. -::: - -* Any MIG-enabled GPU node `` from the last point must be labeled accordingly as follows: +1. Label all MIG-enabled GPU node `` from the previous step: ```bash kubectl label nodes "cdmo.clear.ml/gpu-partitioning=mig" @@ -88,7 +86,7 @@ The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations. ## Disabling MIGs -To disable MIG, follow these steps: +To disable MIG mode and restore standard full-GPU access: 1. Ensure no running workflows are using GPUs on the target node(s). @@ -108,7 +106,7 @@ To disable MIG, follow these steps: nvidia-smi -mig 0 ``` -4. Edit the `gpu-operator.override.yaml` file to have a standard configuration for full GPUs, and upgrade the `gpu-operator`: +4. Edit the `gpu-operator.override.yaml` file to restore full-GPU access, and upgrade the `gpu-operator`: ```yaml toolkit: diff --git a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo_cfgi_same_cluster.md b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo_cfgi_same_cluster.md index b5cbf311..db61dd7f 100644 --- a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo_cfgi_same_cluster.md +++ b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cdmo_cfgi_same_cluster.md @@ -1,7 +1,8 @@ --- -title: Install CDMO and CFGI on the same Cluster +title: Install CDMO and CFGI on the Same Cluster --- +You can install both CDMO (ClearML Dynamic MIG Orchestrator) and CFGI (ClearML Fractional GPU Injector) on a shared Kubernetes cluster. In clusters with multiple nodes and varying GPU types, the `gpu-operator` can be used to manage different device configurations and fractioning modes. @@ -11,7 +12,7 @@ The NVIDIA `gpu-operator` supports defining multiple configurations for the Devi The following example YAML defines two configurations: "mig" and "ts" (time-slicing). -``` yaml +```yaml migManager: enabled: false mig: @@ -69,24 +70,15 @@ devicePlugin: ## Applying Configuration to Nodes -To activate a configuration, label the Kubernetes node accordingly. After a node is labeled, -the NVIDIA `device-plugin` will automatically reload the new configuration. +Label each Kubernetes node accordingly to activate a specific GPU mode: -Example usage: - * Apply the `mig` (MIG mode) config: - ``` bash - kubectl label node nvidia.com/device-plugin.config=mig - ``` +|Mode| Label command| +|----|-----| +| `mig` | `kubectl label node nvidia.com/device-plugin.config=mig` | +| `ts` (time slicing) | `kubectl label node nvidia.com/device-plugin.config=ts` | +| Standard full-GPU access | `kubectl label node nvidia.com/device-plugin.config=all-disabled` | - * Apply the `ts` (time slicing) config: - ``` bash - kubectl label node nvidia.com/device-plugin.config=ts - ``` - - * Apply the `all-disabled` (standard full GPU access) config: - ``` bash - kubectl label node nvidia.com/device-plugin.config=all-disabled - ``` +After a node is labeled, the NVIDIA `device-plugin` will automatically reload the new configuration. ## Installing CDMO and CFGI @@ -97,22 +89,26 @@ and [CFGI](cfgi.md). ### Time Slicing -To switch between time-slicing and full GPU access, update the node label using the `--overwrite` flag: +To disable time-slicing and use full GPU access, update the node label using the `--overwrite` flag: + +```bash +kubectl label node nvidia.com/device-plugin.config=all-disabled --overwrite +``` ### MIG To disable MIG mode: -1. Ensure there are no more running workflows requesting any form of GPU on the node(s) before re-configuring it. -2. Remove the CDMO label from the target node(s) to disable the dynamic MIG reconfiguration. +1. Ensure there are no more running workflows requesting any form of GPU on the node(s). +2. Remove the CDMO label from the target node(s). - ``` bash + ```bash kubectl label nodes "cdmo.clear.ml/gpu-partitioning-" ``` -3. Execute a shell in the `device-plugin-daemonset` Pod instance running on the target node(s) and execute the following commands: +3. Execute a shell in the `device-plugin-daemonset` pod instance running on the target node(s) and execute the following commands: - ``` bash + ```bash nvidia-smi mig -dci nvidia-smi mig -dgi @@ -120,8 +116,8 @@ To disable MIG mode: nvidia-smi -mig 0 ``` -4. Relabel the target node to disable MIG: +4. Label the node to use standard (non-MIG) GPU mode: - ``` bash + ```bash kubectl label node nvidia.com/device-plugin.config=all-disabled --overwrite ``` \ No newline at end of file diff --git a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md index 2cfe16a4..749c02d6 100644 --- a/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md +++ b/docs/deploying_clearml/enterprise_deploy/fractional_gpus/cfgi.md @@ -2,34 +2,36 @@ title: ClearML Fractional GPU Injector (CFGI) --- -The **ClearML Enterprise Fractional GPU Injector** (CFGI) allows AI workloads to run on Kubernetes using non-MIG GPU -fractions, optimizing both hardware utilization and performance. +The **ClearML Enterprise Fractional GPU Injector** (CFGI) allows AI workloads to utilize fractional (non-MIG) GPU slices +on Kubernetes clusters, maximizing hardware efficiency and performance. ## Installation ### Add the Local ClearML Helm Repository -``` bash +```bash helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username --password helm repo update ``` ### Requirements -* Install the official NVIDIA `gpu-operator` using Helm with one of the following configurations. -* The number of slices must be 8. +* Install the NVIDIA `gpu-operator` using Helm +* Set the number of GPU slices to 8 * Add and update the Nvidia Helm repo: - ``` bash + ```bash helm repo add nvidia https://nvidia.github.io/gpu-operator helm repo update ``` + +* Credentials for the ClearML Enterprise DockerHub repository -#### GPU Operator Configuration +### GPU Operator Configuration -##### For CFGI Version >= 1.3.0 +#### For CFGI Version >= 1.3.0 -1. Create a docker-registry secret named `clearml-dockerhub-access` in the `gpu-operator` Namespace, making sure to replace your ``: +1. Create a Docker Registry secret named `clearml-dockerhub-access` in the `gpu-operator` namespace. Make sure to replace `` with your token. ```bash kubectl create secret -n gpu-operator docker-registry clearml-dockerhub-access \ @@ -101,11 +103,11 @@ devicePlugin: replicas: 8 ``` -##### For CFGI version < 1.3.0 (Legacy GPU Operator) +#### For CFGI version < 1.3.0 (Legacy) Create a `gpu-operator.override.yaml` file: -``` yaml +```yaml toolkit: env: - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED @@ -144,26 +146,26 @@ devicePlugin: replicas: 8 ``` -### Install +### Install GPU Operator and CFGI -Install the nvidia `gpu-operator` using the previously created `gpu-operator.override.yaml` override file: +1. Install the NVIDIA `gpu-operator` using the previously created `gpu-operator.override.yaml` file: -```bash -helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml -``` + ```bash + helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml + ``` -Create a `cfgi-values.override.yaml` file with the following content: +1. Create a `cfgi-values.override.yaml` file with the following content: -```yaml -imageCredentials: - password: "" -``` + ```yaml + imageCredentials: + password: "" + ``` -Install the CFGI Helm Chart using the previous override file: +1. Install the CFGI Helm Chart using the previous override file: -```bash -helm install -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector --create-namespace -f cfgi-values.override.yaml -``` + ```bash + helm install -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector --create-namespace -f cfgi-values.override.yaml + ``` ## Usage @@ -187,9 +189,9 @@ Valid values for `""` include: * "0.875" * Integer representation of GPUs such as `1.000`, `2`, `2.0`, etc. -### ClearML Enterprise Agent Configuration +### ClearML Agent Configuration -To run ClearML jobs that request specific GPU fractions, configure the queues in your `clearml-agent-values.override.yaml` file. +To run ClearML jobs with fractional GPU allocation, configure your queues in accordingly in your `clearml-agent-values.override.yaml` file. Each queue should include a `templateOverride` that sets the `clearml-injector/fraction` label, which determines the fraction of a GPU to allocate (e.g., "0.500" for half a GPU). @@ -259,16 +261,16 @@ agentk8sglue: nvidia.com/gpu: 1 ``` -## Upgrading Chart +## Upgrading CFGI Chart -To upgrade to the latest version of this chart: +To upgrade to the latest chart version: ```bash helm repo update helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector ``` -To apply changes to values on an existing installation: +To apply new values to an existing installation: ```bash helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector -f cfgi-values.override.yaml