mirror of
https://github.com/clearml/clearml-docs
synced 2025-06-11 00:58:02 +00:00
Edits
This commit is contained in:
parent
d94d777e55
commit
aaa3851de3
@ -6,8 +6,8 @@ The ClearML Agent enables scheduling and executing distributed experiments on a
|
|||||||
|
|
||||||
## Prerequisites
|
## Prerequisites
|
||||||
|
|
||||||
- A ClearML Enterprise server is up and running.
|
- A running [ClearML Enterprise Server](k8s.md)
|
||||||
- Generate a set of `<ACCESS_KEY>` and `<SECRET_KEY>` credentials in the ClearML Server. The easiest way is via
|
- API credentials (`<ACCESS_KEY>` and `<SECRET_KEY>`) generated via
|
||||||
the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials).
|
the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials).
|
||||||
|
|
||||||
:::note
|
:::note
|
||||||
@ -15,7 +15,7 @@ The ClearML Agent enables scheduling and executing distributed experiments on a
|
|||||||
:::
|
:::
|
||||||
|
|
||||||
- The worker environment must be able to access the ClearML Server over the same network.
|
- The worker environment must be able to access the ClearML Server over the same network.
|
||||||
- * Helm token to access `clearml-enterprise` helm-chart repo
|
- Helm token to access `clearml-enterprise` Helm chart repo
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
@ -36,9 +36,9 @@ helm repo update
|
|||||||
Create a `clearml-agent-values.override.yaml` file with the following content:
|
Create a `clearml-agent-values.override.yaml` file with the following content:
|
||||||
|
|
||||||
:::note
|
:::note
|
||||||
Replace the `<ACCESS_KEY>` and `<SECRET_KEY>` with the admin credentials
|
Replace the `<ACCESS_KEY>` and `<SECRET_KEY>`with the API credentials you generated earlier.
|
||||||
you created earlier. Set `<api|file|web>ServerUrlReference` to the relevant URLs of your ClearML
|
Set the `<api|file|web>ServerUrlReference` fields to match your ClearML
|
||||||
Server installation.
|
Server URLs.
|
||||||
:::
|
:::
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
@ -60,7 +60,7 @@ agentk8sglue:
|
|||||||
|
|
||||||
### Install the Chart
|
### Install the Chart
|
||||||
|
|
||||||
Install the ClearML Enterprise Agent Helm chart using the previous values override file:
|
Install the ClearML Enterprise Agent Helm chart:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
|
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
|
||||||
@ -68,7 +68,7 @@ helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-e
|
|||||||
|
|
||||||
## Additional Configuration Options
|
## Additional Configuration Options
|
||||||
|
|
||||||
To view all configurable options for the Helm chart, run the following command:
|
To view available configuration options for the Helm chart, run the following command:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
helm show readme clearml-enterprise/clearml-enterprise-agent
|
helm show readme clearml-enterprise/clearml-enterprise-agent
|
||||||
@ -76,7 +76,7 @@ helm show readme clearml-enterprise/clearml-enterprise-agent
|
|||||||
helm show values clearml-enterprise/clearml-enterprise-agent
|
helm show values clearml-enterprise/clearml-enterprise-agent
|
||||||
```
|
```
|
||||||
|
|
||||||
### Set GPU Availability in Orchestration Dashboard
|
### Reporting GPU Availability to Orchestration Dashboard
|
||||||
|
|
||||||
To show GPU availability in the [Orchestration Dashboard](../../webapp/webapp_orchestration_dash.md), explicitly set the number of GPUs:
|
To show GPU availability in the [Orchestration Dashboard](../../webapp/webapp_orchestration_dash.md), explicitly set the number of GPUs:
|
||||||
|
|
||||||
@ -88,25 +88,22 @@ agentk8sglue:
|
|||||||
|
|
||||||
### Queues
|
### Queues
|
||||||
|
|
||||||
The ClearML Agent monitors ClearML queues and pulls tasks that are scheduled for execution.
|
The ClearML Agent monitors [ClearML queues](../../fundamentals/agents_and_queues.md) and pulls tasks that are
|
||||||
|
scheduled for execution.
|
||||||
|
|
||||||
A single agent can monitor multiple queues. By default, the queues share a base pod template (`agentk8sglue.basePodTemplate`)
|
A single agent can monitor multiple queues. By default, all queues share a base pod template (`agentk8sglue.basePodTemplate`)
|
||||||
used when submitting a task to Kubernetes after it has been extracted from the queue.
|
used when launching tasks on Kubernetes after it has been pulled from the queue.
|
||||||
|
|
||||||
Each queue can be configured with dedicated Pod template spec override (`templateOverrides`). This way queue definitions
|
Each queue can be configured to override the base pod template with its own settings with a `templateOverrides` queue template.
|
||||||
can be tailored to different use cases.
|
This way queue definitions can be tailored to different use cases.
|
||||||
|
|
||||||
The following are a few examples of agent queue templates:
|
The following are a few examples of agent queue templates:
|
||||||
|
|
||||||
#### Example: GPU Queues
|
#### Example: GPU Queues
|
||||||
|
|
||||||
|
To support GPU queues, you must deploy the NVIDIA GPU Operator on your Kubernetes cluster. For more information, see [GPU Operator](extra_configs/gpu_operator.md).
|
||||||
|
|
||||||
GPU queue support requires deploying the NVIDIA GPU Operator on your Kubernetes cluster.
|
```yaml
|
||||||
|
|
||||||
For more information, see [GPU Operator](extra_configs/gpu_operator.md).
|
|
||||||
|
|
||||||
|
|
||||||
``` yaml
|
|
||||||
agentk8sglue:
|
agentk8sglue:
|
||||||
createQueues: true
|
createQueues: true
|
||||||
queues:
|
queues:
|
||||||
@ -122,8 +119,9 @@ agentk8sglue:
|
|||||||
nvidia.com/gpu: 2
|
nvidia.com/gpu: 2
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Example: Overriding Pod Templates per Queue
|
#### Example: Custom Pod Template per Queue
|
||||||
|
|
||||||
|
This example demonstrates how to override the base pod template definitions on a per-queue basis.
|
||||||
In this example:
|
In this example:
|
||||||
|
|
||||||
- The `red` queue inherits both the label `team=red` and the 1Gi memory limit from the `basePodTemplate` section.
|
- The `red` queue inherits both the label `team=red` and the 1Gi memory limit from the `basePodTemplate` section.
|
||||||
@ -167,5 +165,5 @@ agentk8sglue:
|
|||||||
|
|
||||||
## Next Steps
|
## Next Steps
|
||||||
|
|
||||||
Once the agent is up and running, proceed with deploying the[ ClearML Enterprise App Gateway](appgw_install_k8s.md).
|
Once the agent is up and running, proceed with deploying the [ClearML Enterprise App Gateway](appgw_install_k8s.md).
|
||||||
|
|
||||||
|
@ -2,18 +2,15 @@
|
|||||||
title: Dynamically Edit Task Pod Template
|
title: Dynamically Edit Task Pod Template
|
||||||
---
|
---
|
||||||
|
|
||||||
The ClearML Enterprise Agent supports defining custom Python code to modify a task's Pod template before it is applied
|
ClearML Agent allows you to inject custom Python code to dynamically modify the Kubernetes Pod template before applying it.
|
||||||
to Kubernetes.
|
|
||||||
|
|
||||||
This enables dynamic customization of Task Pod manifests in the context of a ClearML Enterprise Agent, which is useful
|
|
||||||
for injecting values or changing configurations based on runtime context.
|
|
||||||
|
|
||||||
## Agent Configuration
|
## Agent Configuration
|
||||||
|
|
||||||
The `CLEARML_K8S_GLUE_TEMPLATE_MODULE` environment variable defines the Python module and function inside that
|
The `CLEARML_K8S_GLUE_TEMPLATE_MODULE` environment variable defines the Python module and function inside that
|
||||||
module that the ClearML Enterprise Agent should invoke before applying a Task Pod template.
|
module to be invoked by the agent before applying a task pod template.
|
||||||
|
|
||||||
The Agent will run this code in its own context, pass arguments (including the actual template) to the function, and use
|
The agent will run this code in its own context, pass arguments (including the actual template) to the function, and use
|
||||||
the returned template to create the final Task Pod in Kubernetes.
|
the returned template to create the final Task Pod in Kubernetes.
|
||||||
|
|
||||||
Arguments passed to the function include:
|
Arguments passed to the function include:
|
||||||
@ -60,13 +57,13 @@ agentk8sglue:
|
|||||||
```
|
```
|
||||||
|
|
||||||
:::note notes
|
:::note notes
|
||||||
* Make sure to include `*args, **kwargs` at the end of the function's argument list and to only use keyword arguments.
|
* Always include `*args, **kwargs` at the end of the function's argument list and only use keyword arguments.
|
||||||
This is needed to maintain backward compatibility.
|
This is needed to maintain backward compatibility.
|
||||||
|
|
||||||
* Custom code modules can be included as a file in the pod's container, and the environment variable can be used to
|
* Custom code modules can be included as a file in the pod's container, and the environment variable can be used to
|
||||||
point to the file and entry point.
|
point to the file and entry point.
|
||||||
|
|
||||||
* When defining a custom code module, by default the Agent will start watching pods in all namespaces
|
* When defining a custom code module, by default the agent will start watching pods in all namespaces
|
||||||
across the cluster. If you do not intend to give a `ClusterRole` permission, make sure to set the
|
across the cluster. If you do not intend to give a `ClusterRole` permission, make sure to set the
|
||||||
`CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES` env to `"0"` to prevent the Agent to try listing pods in all namespaces.
|
`CLEARML_K8S_GLUE_MONITOR_ALL_NAMESPACES` env to `"0"` to prevent the Agent to try listing pods in all namespaces.
|
||||||
Instead, set it to `"1"` if namespace-related changes are needed in the code.
|
Instead, set it to `"1"` if namespace-related changes are needed in the code.
|
||||||
@ -80,13 +77,13 @@ agentk8sglue:
|
|||||||
|
|
||||||
To customize the bash startup scripts instead of the pod spec, use:
|
To customize the bash startup scripts instead of the pod spec, use:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
agentk8sglue:
|
agentk8sglue:
|
||||||
# -- Custom Bash script for the Agent pod ran by Glue Agent
|
# -- Custom Bash script for the Agent pod ran by Glue Agent
|
||||||
customBashScript: ""
|
customBashScript: ""
|
||||||
# -- Custom Bash script for the Task Pods ran by Glue Agent
|
# -- Custom Bash script for the Task Pods ran by Glue Agent
|
||||||
containerCustomBashScript: ""
|
containerCustomBashScript: ""
|
||||||
```
|
```
|
||||||
|
|
||||||
## Examples
|
## Examples
|
||||||
|
|
||||||
@ -167,7 +164,7 @@ agentk8sglue:
|
|||||||
|
|
||||||
### Example: Bind PVC Resource to Task Pod
|
### Example: Bind PVC Resource to Task Pod
|
||||||
|
|
||||||
In this example, a PVC is created and attached to every Pod created from a dedicated queue, then deleted afterwards.
|
In this example, a PVC is created and attached to every pod created from a dedicated queue, then it is deleted.
|
||||||
|
|
||||||
Key points:
|
Key points:
|
||||||
|
|
||||||
|
@ -2,10 +2,12 @@
|
|||||||
title: Basic Deployment - Suggested GPU Operator Values
|
title: Basic Deployment - Suggested GPU Operator Values
|
||||||
---
|
---
|
||||||
|
|
||||||
|
This guide provides recommended configuration values for deploying the NVIDIA GPU Operator alongside ClearML Enterprise.
|
||||||
|
|
||||||
## Add the Helm Repo Locally
|
## Add the Helm Repo Locally
|
||||||
|
|
||||||
Add the NVIDIA GPU Operator Helm repository:
|
Add the NVIDIA GPU Operator Helm repository:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
helm repo add nvidia https://nvidia.github.io/gpu-operator
|
helm repo add nvidia https://nvidia.github.io/gpu-operator
|
||||||
```
|
```
|
||||||
@ -17,10 +19,8 @@ helm repo update
|
|||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
To prevent unprivileged containers from bypassing the Kubernetes Device Plugin API, configure the GPU operator with the
|
To prevent unprivileged containers from bypassing the Kubernetes Device Plugin API, configure the GPU operator
|
||||||
following override values.
|
using the following `gpu-operator.override.yaml` file:
|
||||||
|
|
||||||
Create a `gpu-operator.override.yaml` file with the following content:
|
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
toolkit:
|
toolkit:
|
||||||
@ -53,7 +53,7 @@ helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace
|
|||||||
|
|
||||||
## Fractional GPU Support
|
## Fractional GPU Support
|
||||||
|
|
||||||
For support with fractional GPUs, refer to the dedicated guides:
|
To enable fractional GPU allocation or manage mixed GPU configurations, refer to the following guides:
|
||||||
* [ClearML Dynamic MIG Operator](../fractional_gpus/cdmo.md) (CDMO) – Dynamically configures MIG GPUs on supported devices.
|
* [ClearML Dynamic MIG Operator](../fractional_gpus/cdmo.md) (CDMO) – Dynamically configures MIG GPUs on supported devices.
|
||||||
* [ClearML Enterprise Fractional GPU Injector](../fractional_gpus/cfgi.md) (CFGI) – Enables fractional (non-MIG) GPU
|
* [ClearML Enterprise Fractional GPU Injector](../fractional_gpus/cfgi.md) (CFGI) – Enables fractional (non-MIG) GPU
|
||||||
allocation for better hardware utilization and workload distribution in Kubernetes.
|
allocation for better hardware utilization and workload distribution in Kubernetes.
|
||||||
|
@ -2,7 +2,8 @@
|
|||||||
title: Multi-Node Training
|
title: Multi-Node Training
|
||||||
---
|
---
|
||||||
|
|
||||||
The ClearML Enterprise Agent supports horizontal multi-node training--running a single Task across multiple Pods on different nodes.
|
The ClearML Enterprise Agent supports horizontal multi-node training, allowing a single Task to run across multiple pods
|
||||||
|
on different nodes.
|
||||||
|
|
||||||
Below is a configuration example using `clearml-agent-values.override.yaml`:
|
Below is a configuration example using `clearml-agent-values.override.yaml`:
|
||||||
|
|
||||||
|
@ -7,16 +7,16 @@ users, enabling direct access to cloud-hosted data (e.g., S3) without exposing c
|
|||||||
|
|
||||||
## Prerequisites
|
## Prerequisites
|
||||||
|
|
||||||
- A ClearML Enterprise server is up and running.
|
- A ClearML Enterprise Server is up and running.
|
||||||
- Generate `<ACCESS_KEY>` and `<SECRET_KEY>` credentials in the ClearML Server. The easiest way is via the ClearML UI
|
- API credentials (`<ACCESS_KEY>` and `<SECRET_KEY>`) generated via
|
||||||
(**Settings > Workspace > App Credentials > Create new credentials**).
|
the ClearML UI (**Settings > Workspace > API Credentials > Create new credentials**). For more information, see [ClearML API Credentials](../../../webapp/settings/webapp_settings_profile.md#clearml-api-credentials).
|
||||||
|
|
||||||
:::note
|
:::note
|
||||||
Make sure these credentials belong to an admin user or a service user with admin privileges.
|
Make sure these credentials belong to an admin user or a service user with admin privileges.
|
||||||
:::
|
:::
|
||||||
|
|
||||||
- The worker environment must be able to access the ClearML Server over the same network.
|
- The worker environment must be able to access the ClearML Server over the same network.
|
||||||
|
- Token to access `clearml-enterprise` Helm chart repo
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
@ -50,7 +50,7 @@ ingress:
|
|||||||
|
|
||||||
### Deploy the Helm Chart
|
### Deploy the Helm Chart
|
||||||
|
|
||||||
Install the `clearml-presign-service` helm chart in the same namespace as the ClearML Enterprise server:
|
Install the `clearml-presign-service` Helm chart in the same namespace as the ClearML Enterprise server:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
helm install -n clearml clearml-presign-service clearml-enterprise/clearml-presign-service -f presign-service.override.yaml
|
helm install -n clearml clearml-presign-service clearml-enterprise/clearml-presign-service -f presign-service.override.yaml
|
||||||
|
@ -2,10 +2,8 @@
|
|||||||
title: ClearML Tenant with Self Signed Certificates
|
title: ClearML Tenant with Self Signed Certificates
|
||||||
---
|
---
|
||||||
|
|
||||||
This guide covers the configuration to support SSL Custom certificates for the following components:
|
This guide covers how to configure the [AI Application Gateway](#ai-application-gateway) and [ClearML Agent](#clearml-agent)
|
||||||
|
to use self-signed or custom SSL certificates.
|
||||||
- ClearML Enterprise AI Application Gateway
|
|
||||||
- ClearML Enterprise Agent
|
|
||||||
|
|
||||||
## AI Application Gateway
|
## AI Application Gateway
|
||||||
|
|
||||||
@ -25,15 +23,15 @@ customCertificates:
|
|||||||
-----END CERTIFICATE-----
|
-----END CERTIFICATE-----
|
||||||
```
|
```
|
||||||
|
|
||||||
In this section, there are two options:
|
You have two configuration options:
|
||||||
|
|
||||||
- [**Replace**](#replace-the-whole-ca-certificatescrt-file) the entire `ca-certificates.crt` file
|
- [**Replace**](#replace-entire-ca-certificatescrt-file) the entire `ca-certificates.crt` file
|
||||||
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt) extra certificates to the existing `ca-certificates.crt`
|
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt) extra certificates to the existing `ca-certificates.crt`
|
||||||
|
|
||||||
|
|
||||||
### Replace Entire `ca-certificates.crt` File
|
### Replace Entire `ca-certificates.crt` File
|
||||||
|
|
||||||
To replace the whole ca-bundle, you can attach a concatenation of all your valid CA in a `pem` format as
|
To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as
|
||||||
they are stored in a standard `ca-certificates.crt`.
|
they are stored in a standard `ca-certificates.crt`.
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
@ -55,7 +53,7 @@ customCertificates:
|
|||||||
|
|
||||||
### Append Extra Certificates to the Existing `ca-certificates.crt`
|
### Append Extra Certificates to the Existing `ca-certificates.crt`
|
||||||
|
|
||||||
You can add certificates to the existing CA bundle. Ensure each certificate has a unique `alias`.
|
You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`.
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
# -- Custom certificates
|
# -- Custom certificates
|
||||||
@ -82,9 +80,9 @@ To apply the changes, run the update command:
|
|||||||
helm upgrade -i <RELEASE_NAME> -n <WORKLOAD_NAMESPACE> clearml-enterprise/clearml-enterprise-app-gateway --version <CHART_VERSION> -f clearml-app-gateway-values.override.yaml
|
helm upgrade -i <RELEASE_NAME> -n <WORKLOAD_NAMESPACE> clearml-enterprise/clearml-enterprise-app-gateway --version <CHART_VERSION> -f clearml-app-gateway-values.override.yaml
|
||||||
```
|
```
|
||||||
|
|
||||||
## ClearML Enterprise Agent
|
## ClearML Agent
|
||||||
|
|
||||||
For the Agent, configure certificates in the `clearml-agent-values.override.yaml` file:
|
For the ClearML Agent, configure certificates in the `clearml-agent-values.override.yaml` file:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
# -- Custom certificates
|
# -- Custom certificates
|
||||||
@ -100,17 +98,18 @@ customCertificates:
|
|||||||
-----END CERTIFICATE-----
|
-----END CERTIFICATE-----
|
||||||
```
|
```
|
||||||
|
|
||||||
In the section, there are two options:
|
You have two configuration options:
|
||||||
|
|
||||||
- [**Replace**](#replace-the-whole-ca-certificatescrt-file) the entire `ca-certificates.crt` file
|
- [**Replace**](#replace-entire-ca-certificatescrt-file-1) the entire `ca-certificates.crt` file
|
||||||
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt) extra certificates to the existing `ca-certificates.crt`
|
- [**Append**](#append-extra-certificates-to-the-existing-ca-certificatescrt-1) extra certificates to the existing `ca-certificates.crt`
|
||||||
|
|
||||||
|
|
||||||
### Replace Entire `ca-certificates.crt` File
|
### Replace Entire `ca-certificates.crt` File
|
||||||
|
|
||||||
If you need to replace the whole ca-bundle you can attach a concatenation of all your valid CA in a `pem` format like
|
To replace the whole ca-bundle, provide a concatenated list of all trusted CA certificates in `pem` format as
|
||||||
they are stored in a standard `ca-certificates.crt`.
|
they are stored in a standard `ca-certificates.crt`.
|
||||||
|
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
# -- Custom certificates
|
# -- Custom certificates
|
||||||
customCertificates:
|
customCertificates:
|
||||||
@ -130,7 +129,7 @@ customCertificates:
|
|||||||
|
|
||||||
### Append Extra Certificates to the Existing `ca-certificates.crt`
|
### Append Extra Certificates to the Existing `ca-certificates.crt`
|
||||||
|
|
||||||
You can add certificates to the existing CA bundle. Ensure each certificate has a unique `alias`.
|
You can add certificates to the existing CA bundle. Each certificate must have a unique `alias`.
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
# -- Custom certificates
|
# -- Custom certificates
|
||||||
@ -151,7 +150,7 @@ customCertificates:
|
|||||||
|
|
||||||
### Add Certificates to Task Pods
|
### Add Certificates to Task Pods
|
||||||
|
|
||||||
If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into Pods:
|
If your workloads need access to these certificates (e.g., for HTTPS requests), configure the agent to inject them into pods:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
agentk8sglue:
|
agentk8sglue:
|
||||||
@ -195,7 +194,7 @@ Their names are usually prefixed with the Helm release name, so adjust according
|
|||||||
|
|
||||||
### Apply Changes
|
### Apply Changes
|
||||||
|
|
||||||
Applying the changes by running the the update command:
|
Apply the changes by running the update command:
|
||||||
|
|
||||||
``` bash
|
``` bash
|
||||||
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
|
helm upgrade -i -n <WORKER_NAMESPACE> clearml-agent clearml-enterprise/clearml-enterprise-agent --create-namespace -f clearml-agent-values.override.yaml
|
||||||
|
@ -3,7 +3,8 @@ title: SSO (Identity Provider) Setup
|
|||||||
---
|
---
|
||||||
|
|
||||||
ClearML Enterprise Server supports various Single Sign-On (SSO) identity providers.
|
ClearML Enterprise Server supports various Single Sign-On (SSO) identity providers.
|
||||||
SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and applied to the `apiserver` component.
|
SSO configuration is managed via environment variables in your `clearml-values.override.yaml` file and is applied to the
|
||||||
|
`apiserver` component.
|
||||||
|
|
||||||
The following are configuration examples for commonly used providers. Other supported systems include:
|
The following are configuration examples for commonly used providers. Other supported systems include:
|
||||||
* Auth0
|
* Auth0
|
||||||
@ -11,7 +12,7 @@ The following are configuration examples for commonly used providers. Other supp
|
|||||||
* Okta
|
* Okta
|
||||||
* Azure AD
|
* Azure AD
|
||||||
* Google
|
* Google
|
||||||
* and AWS Cognito
|
* AWS Cognito
|
||||||
|
|
||||||
## Auth0
|
## Auth0
|
||||||
|
|
||||||
@ -56,17 +57,17 @@ apiserver:
|
|||||||
To map Keycloak groups into the ClearML user's SSO token:
|
To map Keycloak groups into the ClearML user's SSO token:
|
||||||
|
|
||||||
1. Go to the **Client Scopes** tab.
|
1. Go to the **Client Scopes** tab.
|
||||||
1. Click on the first row `<clearml client>-dedicated`.
|
1. Click on the `<clearml client>-dedicated` scope.
|
||||||
1. Click **Add Mapper > By configuration > Group membership**
|
1. Click **Add Mapper > By Configuration > Group Membership**
|
||||||
1. In the dialog:
|
1. Configure the mapper:
|
||||||
* select the **Name** "groups"
|
* Select the **Name** "groups"
|
||||||
* Set **Token Claim Name** "groups"
|
* Set **Token Claim Name** "groups"
|
||||||
* Uncheck the **Full group path**
|
* Uncheck the **Full group path**
|
||||||
* Save the mapper.
|
* Save the mapper.
|
||||||
|
|
||||||
To verify:
|
To verify:
|
||||||
|
|
||||||
1. Return to **Client Details > Client scope** tab.
|
1. Go to the **Client Details > Client scope** tab.
|
||||||
1. Go to the Evaluate sub-tab and select a user who has any group memberships.
|
1. Go to the **Evaluate** sub-tab and select a user with any group memberships.
|
||||||
1. Go to **Generated ID token** and then to **Generated User Info**.
|
1. Go to **Generated ID Token** and then to **Generated User Info**.
|
||||||
1Inspect that in both cases you can see the group's claim in the displayed user data.
|
1. Inspect that in both cases you can see the group's claim in the displayed user data.
|
||||||
|
@ -2,14 +2,12 @@
|
|||||||
title: ClearML Dynamic MIG Operator (CDMO)
|
title: ClearML Dynamic MIG Operator (CDMO)
|
||||||
---
|
---
|
||||||
|
|
||||||
The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations.
|
The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG (Multi-Instance GPU) configurations.
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
### Requirements
|
### Requirements
|
||||||
|
|
||||||
* Install the official NVIDIA `gpu-operator` using Helm with one of the following configurations.
|
|
||||||
|
|
||||||
* Add and update the Nvidia Helm repo:
|
* Add and update the Nvidia Helm repo:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@ -46,7 +44,7 @@ The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations.
|
|||||||
value: all
|
value: all
|
||||||
```
|
```
|
||||||
|
|
||||||
* Install the official NVIDIA `gpu-operator` using Helm with the previous configuration:
|
* Install the NVIDIA `gpu-operator` using Helm with the previous configuration:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
|
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
|
||||||
@ -54,33 +52,33 @@ The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations.
|
|||||||
|
|
||||||
### Installing CDMO
|
### Installing CDMO
|
||||||
|
|
||||||
* Create a `cdmo-values.override.yaml` file with the following content:
|
1. Create a `cdmo-values.override.yaml` file with the following content:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
imageCredentials:
|
imageCredentials:
|
||||||
password: "<CLEARML_DOCKERHUB_TOKEN>"
|
password: "<CLEARML_DOCKERHUB_TOKEN>"
|
||||||
```
|
```
|
||||||
|
|
||||||
* Install the CDMO Helm Chart using the previous override file:
|
1. Install the CDMO Helm Chart using the previous override file:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml
|
helm install -n cdmo cdmo clearml-enterprise/clearml-dynamic-mig-operator --create-namespace -f cdmo-values.override.yaml
|
||||||
```
|
```
|
||||||
|
|
||||||
* Enable the NVIDIA MIG support on your cluster by running the following command on all nodes with a MIG-supported GPU
|
1. Enable the NVIDIA MIG support on your cluster by running the following command on all nodes with a MIG-supported GPU
|
||||||
(run it for each GPU `<GPU_ID>` on the host):
|
(run it for each GPU `<GPU_ID>` on the host):
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
nvidia-smi -mig 1
|
nvidia-smi -mig 1
|
||||||
```
|
```
|
||||||
|
|
||||||
:::note notes
|
:::note notes
|
||||||
* The node reboot may be required if the command output indicates so.
|
* A node reboot may be required if the command output indicates so.
|
||||||
|
|
||||||
|
* For convenience, this command can be run from within the `nvidia-device-plugin-daemonset` pod running on the related node.
|
||||||
|
:::
|
||||||
|
|
||||||
* For convenience, this command can be issued from inside the `nvidia-device-plugin-daemonset` pod running on the related node.
|
1. Label all MIG-enabled GPU node `<NODE_NAME>` from the previous step:
|
||||||
:::
|
|
||||||
|
|
||||||
* Any MIG-enabled GPU node `<NODE_NAME>` from the last point must be labeled accordingly as follows:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"
|
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning=mig"
|
||||||
@ -88,7 +86,7 @@ The ClearML Dynamic MIG Operator (CDMO) enables dynamic MIG GPU configurations.
|
|||||||
|
|
||||||
## Disabling MIGs
|
## Disabling MIGs
|
||||||
|
|
||||||
To disable MIG, follow these steps:
|
To disable MIG mode and restore standard full-GPU access:
|
||||||
|
|
||||||
1. Ensure no running workflows are using GPUs on the target node(s).
|
1. Ensure no running workflows are using GPUs on the target node(s).
|
||||||
|
|
||||||
@ -108,7 +106,7 @@ To disable MIG, follow these steps:
|
|||||||
nvidia-smi -mig 0
|
nvidia-smi -mig 0
|
||||||
```
|
```
|
||||||
|
|
||||||
4. Edit the `gpu-operator.override.yaml` file to have a standard configuration for full GPUs, and upgrade the `gpu-operator`:
|
4. Edit the `gpu-operator.override.yaml` file to restore full-GPU access, and upgrade the `gpu-operator`:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
toolkit:
|
toolkit:
|
||||||
|
@ -1,7 +1,8 @@
|
|||||||
---
|
---
|
||||||
title: Install CDMO and CFGI on the same Cluster
|
title: Install CDMO and CFGI on the Same Cluster
|
||||||
---
|
---
|
||||||
|
|
||||||
|
You can install both CDMO (ClearML Dynamic MIG Orchestrator) and CFGI (ClearML Fractional GPU Injector) on a shared Kubernetes cluster.
|
||||||
In clusters with multiple nodes and varying GPU types, the `gpu-operator` can be used to manage different device configurations
|
In clusters with multiple nodes and varying GPU types, the `gpu-operator` can be used to manage different device configurations
|
||||||
and fractioning modes.
|
and fractioning modes.
|
||||||
|
|
||||||
@ -11,7 +12,7 @@ The NVIDIA `gpu-operator` supports defining multiple configurations for the Devi
|
|||||||
|
|
||||||
The following example YAML defines two configurations: "mig" and "ts" (time-slicing).
|
The following example YAML defines two configurations: "mig" and "ts" (time-slicing).
|
||||||
|
|
||||||
``` yaml
|
```yaml
|
||||||
migManager:
|
migManager:
|
||||||
enabled: false
|
enabled: false
|
||||||
mig:
|
mig:
|
||||||
@ -69,24 +70,15 @@ devicePlugin:
|
|||||||
|
|
||||||
## Applying Configuration to Nodes
|
## Applying Configuration to Nodes
|
||||||
|
|
||||||
To activate a configuration, label the Kubernetes node accordingly. After a node is labeled,
|
Label each Kubernetes node accordingly to activate a specific GPU mode:
|
||||||
the NVIDIA `device-plugin` will automatically reload the new configuration.
|
|
||||||
|
|
||||||
Example usage:
|
|Mode| Label command|
|
||||||
* Apply the `mig` (MIG mode) config:
|
|----|-----|
|
||||||
``` bash
|
| `mig` | `kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=mig` |
|
||||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=mig
|
| `ts` (time slicing) | `kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=ts` |
|
||||||
```
|
| Standard full-GPU access | `kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled` |
|
||||||
|
|
||||||
* Apply the `ts` (time slicing) config:
|
After a node is labeled, the NVIDIA `device-plugin` will automatically reload the new configuration.
|
||||||
``` bash
|
|
||||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=ts
|
|
||||||
```
|
|
||||||
|
|
||||||
* Apply the `all-disabled` (standard full GPU access) config:
|
|
||||||
``` bash
|
|
||||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled
|
|
||||||
```
|
|
||||||
|
|
||||||
## Installing CDMO and CFGI
|
## Installing CDMO and CFGI
|
||||||
|
|
||||||
@ -97,22 +89,26 @@ and [CFGI](cfgi.md).
|
|||||||
|
|
||||||
### Time Slicing
|
### Time Slicing
|
||||||
|
|
||||||
To switch between time-slicing and full GPU access, update the node label using the `--overwrite` flag:
|
To disable time-slicing and use full GPU access, update the node label using the `--overwrite` flag:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled --overwrite
|
||||||
|
```
|
||||||
|
|
||||||
### MIG
|
### MIG
|
||||||
|
|
||||||
To disable MIG mode:
|
To disable MIG mode:
|
||||||
|
|
||||||
1. Ensure there are no more running workflows requesting any form of GPU on the node(s) before re-configuring it.
|
1. Ensure there are no more running workflows requesting any form of GPU on the node(s).
|
||||||
2. Remove the CDMO label from the target node(s) to disable the dynamic MIG reconfiguration.
|
2. Remove the CDMO label from the target node(s).
|
||||||
|
|
||||||
``` bash
|
```bash
|
||||||
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
|
kubectl label nodes <NODE_NAME> "cdmo.clear.ml/gpu-partitioning-"
|
||||||
```
|
```
|
||||||
|
|
||||||
3. Execute a shell in the `device-plugin-daemonset` Pod instance running on the target node(s) and execute the following commands:
|
3. Execute a shell in the `device-plugin-daemonset` pod instance running on the target node(s) and execute the following commands:
|
||||||
|
|
||||||
``` bash
|
```bash
|
||||||
nvidia-smi mig -dci
|
nvidia-smi mig -dci
|
||||||
|
|
||||||
nvidia-smi mig -dgi
|
nvidia-smi mig -dgi
|
||||||
@ -120,8 +116,8 @@ To disable MIG mode:
|
|||||||
nvidia-smi -mig 0
|
nvidia-smi -mig 0
|
||||||
```
|
```
|
||||||
|
|
||||||
4. Relabel the target node to disable MIG:
|
4. Label the node to use standard (non-MIG) GPU mode:
|
||||||
|
|
||||||
``` bash
|
```bash
|
||||||
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled --overwrite
|
kubectl label node <NODE_NAME> nvidia.com/device-plugin.config=all-disabled --overwrite
|
||||||
```
|
```
|
@ -2,34 +2,36 @@
|
|||||||
title: ClearML Fractional GPU Injector (CFGI)
|
title: ClearML Fractional GPU Injector (CFGI)
|
||||||
---
|
---
|
||||||
|
|
||||||
The **ClearML Enterprise Fractional GPU Injector** (CFGI) allows AI workloads to run on Kubernetes using non-MIG GPU
|
The **ClearML Enterprise Fractional GPU Injector** (CFGI) allows AI workloads to utilize fractional (non-MIG) GPU slices
|
||||||
fractions, optimizing both hardware utilization and performance.
|
on Kubernetes clusters, maximizing hardware efficiency and performance.
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
### Add the Local ClearML Helm Repository
|
### Add the Local ClearML Helm Repository
|
||||||
|
|
||||||
``` bash
|
```bash
|
||||||
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <GITHUB_TOKEN> --password <GITHUB_TOKEN>
|
helm repo add clearml-enterprise https://raw.githubusercontent.com/clearml/clearml-enterprise-helm-charts/gh-pages --username <GITHUB_TOKEN> --password <GITHUB_TOKEN>
|
||||||
helm repo update
|
helm repo update
|
||||||
```
|
```
|
||||||
|
|
||||||
### Requirements
|
### Requirements
|
||||||
|
|
||||||
* Install the official NVIDIA `gpu-operator` using Helm with one of the following configurations.
|
* Install the NVIDIA `gpu-operator` using Helm
|
||||||
* The number of slices must be 8.
|
* Set the number of GPU slices to 8
|
||||||
* Add and update the Nvidia Helm repo:
|
* Add and update the Nvidia Helm repo:
|
||||||
|
|
||||||
``` bash
|
```bash
|
||||||
helm repo add nvidia https://nvidia.github.io/gpu-operator
|
helm repo add nvidia https://nvidia.github.io/gpu-operator
|
||||||
helm repo update
|
helm repo update
|
||||||
```
|
```
|
||||||
|
|
||||||
|
* Credentials for the ClearML Enterprise DockerHub repository
|
||||||
|
|
||||||
#### GPU Operator Configuration
|
### GPU Operator Configuration
|
||||||
|
|
||||||
##### For CFGI Version >= 1.3.0
|
#### For CFGI Version >= 1.3.0
|
||||||
|
|
||||||
1. Create a docker-registry secret named `clearml-dockerhub-access` in the `gpu-operator` Namespace, making sure to replace your `<CLEARML_DOCKERHUB_TOKEN>`:
|
1. Create a Docker Registry secret named `clearml-dockerhub-access` in the `gpu-operator` namespace. Make sure to replace `<CLEARML_DOCKERHUB_TOKEN>` with your token.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
kubectl create secret -n gpu-operator docker-registry clearml-dockerhub-access \
|
kubectl create secret -n gpu-operator docker-registry clearml-dockerhub-access \
|
||||||
@ -101,11 +103,11 @@ devicePlugin:
|
|||||||
replicas: 8
|
replicas: 8
|
||||||
```
|
```
|
||||||
|
|
||||||
##### For CFGI version < 1.3.0 (Legacy GPU Operator)
|
#### For CFGI version < 1.3.0 (Legacy)
|
||||||
|
|
||||||
Create a `gpu-operator.override.yaml` file:
|
Create a `gpu-operator.override.yaml` file:
|
||||||
|
|
||||||
``` yaml
|
```yaml
|
||||||
toolkit:
|
toolkit:
|
||||||
env:
|
env:
|
||||||
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
- name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
|
||||||
@ -144,26 +146,26 @@ devicePlugin:
|
|||||||
replicas: 8
|
replicas: 8
|
||||||
```
|
```
|
||||||
|
|
||||||
### Install
|
### Install GPU Operator and CFGI
|
||||||
|
|
||||||
Install the nvidia `gpu-operator` using the previously created `gpu-operator.override.yaml` override file:
|
1. Install the NVIDIA `gpu-operator` using the previously created `gpu-operator.override.yaml` file:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
|
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --create-namespace -f gpu-operator.override.yaml
|
||||||
```
|
```
|
||||||
|
|
||||||
Create a `cfgi-values.override.yaml` file with the following content:
|
1. Create a `cfgi-values.override.yaml` file with the following content:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
imageCredentials:
|
imageCredentials:
|
||||||
password: "<CLEARML_DOCKERHUB_TOKEN>"
|
password: "<CLEARML_DOCKERHUB_TOKEN>"
|
||||||
```
|
```
|
||||||
|
|
||||||
Install the CFGI Helm Chart using the previous override file:
|
1. Install the CFGI Helm Chart using the previous override file:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
helm install -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector --create-namespace -f cfgi-values.override.yaml
|
helm install -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector --create-namespace -f cfgi-values.override.yaml
|
||||||
```
|
```
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
@ -187,9 +189,9 @@ Valid values for `"<GPU_FRACTION_VALUE>"` include:
|
|||||||
* "0.875"
|
* "0.875"
|
||||||
* Integer representation of GPUs such as `1.000`, `2`, `2.0`, etc.
|
* Integer representation of GPUs such as `1.000`, `2`, `2.0`, etc.
|
||||||
|
|
||||||
### ClearML Enterprise Agent Configuration
|
### ClearML Agent Configuration
|
||||||
|
|
||||||
To run ClearML jobs that request specific GPU fractions, configure the queues in your `clearml-agent-values.override.yaml` file.
|
To run ClearML jobs with fractional GPU allocation, configure your queues in accordingly in your `clearml-agent-values.override.yaml` file.
|
||||||
|
|
||||||
Each queue should include a `templateOverride` that sets the `clearml-injector/fraction` label, which determines the
|
Each queue should include a `templateOverride` that sets the `clearml-injector/fraction` label, which determines the
|
||||||
fraction of a GPU to allocate (e.g., "0.500" for half a GPU).
|
fraction of a GPU to allocate (e.g., "0.500" for half a GPU).
|
||||||
@ -259,16 +261,16 @@ agentk8sglue:
|
|||||||
nvidia.com/gpu: 1
|
nvidia.com/gpu: 1
|
||||||
```
|
```
|
||||||
|
|
||||||
## Upgrading Chart
|
## Upgrading CFGI Chart
|
||||||
|
|
||||||
To upgrade to the latest version of this chart:
|
To upgrade to the latest chart version:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
helm repo update
|
helm repo update
|
||||||
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector
|
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector
|
||||||
```
|
```
|
||||||
|
|
||||||
To apply changes to values on an existing installation:
|
To apply new values to an existing installation:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector -f cfgi-values.override.yaml
|
helm upgrade -n cfgi cfgi clearml-enterprise/clearml-fractional-gpu-injector -f cfgi-values.override.yaml
|
||||||
|
Loading…
Reference in New Issue
Block a user